Speaker- and Age-invariant Training for Child Acoustic Modelling Using Adversarial Multi-task
Learning
Mostafa Shahin, Julien Epps, and Beena Ahmed
School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, Australia
ABSTRACT
One of the major challenges in acoustic modelling of child speech
is the rapid changes that occur in the children’s articulators as they
grow up, their differing growth rates and the subsequent high vari-
ability in the same age group. These high acoustic variations along
with the scarcity of child speech corpora have impeded the devel-
opment of a reliable speech recognition system for children. In this
paper, a speaker- and age-invariant training approach based on ad-
versarial multi-task learning is proposed. The system consists of one
generator shared network that learns to generate speaker- and age-
invariant features connected to three discrimination networks, for
phoneme, age, and speaker. The generator network is trained to min-
imize the phoneme-discrimination loss and maximize the speaker-
and age-discrimination losses in an adversarial multi-task learning
fashion. The generator network is a Time Delay Neural Network
(TDNN) architecture while the three discriminators are feed-forward
networks. The system was applied to the OGI speech corpora and
achieved a 13
Index Terms—adversarial multitask learning, child speech
recognition, age-invariant, speaker-invariant
1. INTRODUCTION
Despite the enormous improvement in the acoustic modelling of
adult speech over the last few decades, less progress has been made
on the acoustic modelling of child speech.
Automatic speech recognition systems trained on adult speech
have shown a dramatic degradation in performance when tested on
child speech due to linguistic and acoustic mismatches between adult
and child speech [1]. Children have higher fundamental and for-
mant frequencies due to their smaller vocal cords and shorter vocal
tract [2]. Furthermore, the shape of the vocal tract changes rapidly
as children grow up and their ability to correctly pronounce speech
sounds improves. This leads to wider intra- and inter-speaker varia-
tions compared with adult speech [3, 4].
Several approaches were initially proposed to handle variations
in child speech at the feature level such as Vocal Tract Length Nor-
malization (VTLN) [1, 5, 6, 7], Stochastic Feature Mapping (SFM)
[8], and Pitch Adaptive Mel-Frequency Cepstral Coefficient (PAM-
FCC) [9]. Recently, deep learning techniques have become state-of-
the-art in speech recognition systems, however, training such models
needs a considerably large amount of speech data which is not avail-
able for child speech. Therefore, different domain adaptation and
data augmentation techniques have been explored that incorporate
both adult and child speech corpora such as teacher-student domain
adaptation [10], transfer learning [11, 12], and Multi-Task Learning
(MTL) [13].
Several works have studied the effect of age on the performance
of child speech recognition. As expected, the performance degraded
with a decrease of age when either an acoustic model trained on
adult speech [5] or age-specific acoustic model [14] was used. Due
to the limited availability of child speech corpora, it is hard to train
an accurate acoustic model for each age range. In [1] acoustic model
adaptation was used to adapt an adult acoustic model to the different
age ranges. An age-dependent speaker normalisation technique was
proposed in [15] using subglottal resonances.
Adversarial multi-task learning [16] has been used in literature
for speaker invariant training of adult speech [17, 18]. However,
acoustic variations in child speech are caused by both speaker and
age variations due to the rapid, non-uniform growth of their artic-
ulators. In this paper, we thus investigated whether an adversarial
multi-task learning approach can be used to alleviate the effect of
both speaker and age acoustic variations in child speech. To achieve
this, we proposed a system that uses two adversarial tasks, one for
age and one for speaker discrimination, to generate speaker- and age-
invariant features. Unlike most existing adversarial multi-task learn-
ing architectures where only one adversarial task was learnt jointly
with the main task, the proposed architecture uses two adversarial
tasks that were simultaneously trained along with the main phonetic
discrimination task. Moreover, this is the first work to address both
age and speaker variations in child speech using adversarial train-
ing. The system is validated using child speech corpus with a large
number of speakers distributed over 11 age groups.
2. METHOD
Since Goodfellow presented the Generative Adversarial Network
(GAN) as a novel method to generate samples from a target distri-
bution [19], a variety of adversarial learning techniques have been
proposed in literature including adversarial multi-task learning [16].
In traditional multi-task learning, a second different but relevant
task is trained sharing part of the network, with the primary task
to improve the generalization of the primary task [20]. In contrast,
in adversarial multi-task learning, the shared network is learnt ad-
versarially to the secondary task, i.e., to not discriminate between
classes in the secondary task, resulting in representations invariant
to the secondary task. Adversarial training has been successfully
utilized to improve the robustness of speech recognition systems
against noisy environments [21], speaker variations [17, 18], and ac-
cent variation [22]. Here we leverage adversarial multi-task learning
to handle the high speaker- and age-variations in child speech.
Figure 1 depicts our proposed architecture of the adversarial
multi-task learning network. The architecture consists of four sub-
networks: the generative network (G) which is the core network and
acts as the feature extraction network, the phoneme recognition net-
work (P) which is trained to classify senones, the speaker discrimi-
nation network (S) which is trained to discriminate between speech
from different speakers, and the age group discrimination network
(A) which is trained to discriminate between different age groups.
arXiv:2210.10231v2 [cs.SD] 7 Nov 2022