Speaker- and Age-invariant Training for Child Acoustic Modelling Using Adversarial Multi-task Learning Mostafa Shahin Julien Epps and Beena Ahmed

2025-04-24 0 0 296.45KB 5 页 10玖币

侵权投诉

Speaker- and Age-invariant Training for Child Acoustic Modelling Using Adversarial Multi-task

Learning

Mostafa Shahin, Julien Epps, and Beena Ahmed

School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, Australia

ABSTRACT

One of the major challenges in acoustic modelling of child speech

is the rapid changes that occur in the children’s articulators as they

grow up, their differing growth rates and the subsequent high vari-

ability in the same age group. These high acoustic variations along

with the scarcity of child speech corpora have impeded the devel-

opment of a reliable speech recognition system for children. In this

paper, a speaker- and age-invariant training approach based on ad-

versarial multi-task learning is proposed. The system consists of one

generator shared network that learns to generate speaker- and age-

invariant features connected to three discrimination networks, for

phoneme, age, and speaker. The generator network is trained to min-

imize the phoneme-discrimination loss and maximize the speaker-

and age-discrimination losses in an adversarial multi-task learning

fashion. The generator network is a Time Delay Neural Network

(TDNN) architecture while the three discriminators are feed-forward

networks. The system was applied to the OGI speech corpora and

achieved a 13

Index Terms—adversarial multitask learning, child speech

recognition, age-invariant, speaker-invariant

1. INTRODUCTION

Despite the enormous improvement in the acoustic modelling of

adult speech over the last few decades, less progress has been made

on the acoustic modelling of child speech.

Automatic speech recognition systems trained on adult speech

have shown a dramatic degradation in performance when tested on

child speech due to linguistic and acoustic mismatches between adult

and child speech [1]. Children have higher fundamental and for-

mant frequencies due to their smaller vocal cords and shorter vocal

tract [2]. Furthermore, the shape of the vocal tract changes rapidly

as children grow up and their ability to correctly pronounce speech

sounds improves. This leads to wider intra- and inter-speaker varia-

tions compared with adult speech [3, 4].

Several approaches were initially proposed to handle variations

in child speech at the feature level such as Vocal Tract Length Nor-

malization (VTLN) [1, 5, 6, 7], Stochastic Feature Mapping (SFM)

[8], and Pitch Adaptive Mel-Frequency Cepstral Coefﬁcient (PAM-

FCC) [9]. Recently, deep learning techniques have become state-of-

the-art in speech recognition systems, however, training such models

needs a considerably large amount of speech data which is not avail-

able for child speech. Therefore, different domain adaptation and

data augmentation techniques have been explored that incorporate

both adult and child speech corpora such as teacher-student domain

adaptation [10], transfer learning [11, 12], and Multi-Task Learning

(MTL) [13].

Several works have studied the effect of age on the performance

of child speech recognition. As expected, the performance degraded

with a decrease of age when either an acoustic model trained on

adult speech [5] or age-speciﬁc acoustic model [14] was used. Due

to the limited availability of child speech corpora, it is hard to train

an accurate acoustic model for each age range. In [1] acoustic model

adaptation was used to adapt an adult acoustic model to the different

age ranges. An age-dependent speaker normalisation technique was

proposed in [15] using subglottal resonances.

Adversarial multi-task learning [16] has been used in literature

for speaker invariant training of adult speech [17, 18]. However,

acoustic variations in child speech are caused by both speaker and

age variations due to the rapid, non-uniform growth of their artic-

ulators. In this paper, we thus investigated whether an adversarial

multi-task learning approach can be used to alleviate the effect of

both speaker and age acoustic variations in child speech. To achieve

this, we proposed a system that uses two adversarial tasks, one for

age and one for speaker discrimination, to generate speaker- and age-

invariant features. Unlike most existing adversarial multi-task learn-

ing architectures where only one adversarial task was learnt jointly

with the main task, the proposed architecture uses two adversarial

tasks that were simultaneously trained along with the main phonetic

discrimination task. Moreover, this is the ﬁrst work to address both

age and speaker variations in child speech using adversarial train-

ing. The system is validated using child speech corpus with a large

number of speakers distributed over 11 age groups.

2. METHOD

Since Goodfellow presented the Generative Adversarial Network

(GAN) as a novel method to generate samples from a target distri-

bution [19], a variety of adversarial learning techniques have been

proposed in literature including adversarial multi-task learning [16].

In traditional multi-task learning, a second different but relevant

task is trained sharing part of the network, with the primary task

to improve the generalization of the primary task [20]. In contrast,

in adversarial multi-task learning, the shared network is learnt ad-

versarially to the secondary task, i.e., to not discriminate between

classes in the secondary task, resulting in representations invariant

to the secondary task. Adversarial training has been successfully

utilized to improve the robustness of speech recognition systems

against noisy environments [21], speaker variations [17, 18], and ac-

cent variation [22]. Here we leverage adversarial multi-task learning

to handle the high speaker- and age-variations in child speech.

Figure 1 depicts our proposed architecture of the adversarial

multi-task learning network. The architecture consists of four sub-

networks: the generative network (G) which is the core network and

acts as the feature extraction network, the phoneme recognition net-

work (P) which is trained to classify senones, the speaker discrimi-

nation network (S) which is trained to discriminate between speech

from different speakers, and the age group discrimination network

(A) which is trained to discriminate between different age groups.

arXiv:2210.10231v2 [cs.SD] 7 Nov 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Speaker-andAge-invariantTrainingforChildAcousticModellingUsingAdversarialMulti-taskLearningMostafaShahin,JulienEpps,andBeenaAhmedSchoolofElectricalEngineeringandTelecommunications,UniversityofNewSouthWales,Sydney,AustraliaABSTRACTOneofthemajorchallengesinacousticmodellingofchildspeechistherapidchang...

展开>> 收起<<

Speaker- and Age-invariant Training for Child Acoustic Modelling Using Adversarial Multi-task Learning Mostafa Shahin Julien Epps and Beena Ahmed.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Speaker- and Age-invariant Training for Child Acoustic Modelling Using Adversarial Multi-task Learning Mostafa Shahin Julien Epps and Beena Ahmed

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: