
SVLDL: IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE VARIANCE LABEL
DISTRIBUTION LEARNING
Zuheng Kang, Jianzong Wang*, Junqing Peng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
ABSTRACT
Estimating age from a single speech is a classic and chal-
lenging topic. Although Label Distribution Learning (LDL)
can represent adjacent indistinguishable ages well, the uncer-
tainty of the age estimate for each utterance varies from person
to person, i.e., the variance of the age distribution is differ-
ent. To address this issue, we propose selective variance label
distribution learning (SVLDL) method to adapt the variance
of different age distributions. Furthermore, the model uses
WavLM as the speech feature extractor and adds the auxiliary
task of gender recognition to further improve the performance.
Two tricks are applied on the loss function to enhance the
robustness of the age estimation and improve the quality of the
fitted age distribution. Extensive experiments show that the
model achieves state-of-the-art performance on all aspects of
the NIST SRE08-10 and a real-world datasets.
Index Terms—
speaker age estimation, label distribution
learning, multi-task learning, gender recognition
1. INTRODUCTION
Speech is the sound produced by the accurate coordinated
movement of multiple organs in the human body. Hence, the
acoustic characteristics of speech can transmit information
about the physical characteristics of the speaker. The rapid
development of new speech applications requires techniques
capable of estimating information on various biological at-
tributes of such speakers. Recently, deep-learning-based ap-
proaches show great performance in extracting hidden speech
information, including facial expression [
1
] and emotion [
2
],
and age [
3
], etc. If such speech features can be used to auto-
matically estimate a speaker’s age, it could be widely used for
human-computer interaction, forensics, and other purposes.
Many researchers have studied the performance of hu-
man and artificial intelligence systems in estimating age from
speech. The results show that the average error of humans judg-
ing the age of adults is about 10 years old, and the judgment of
the age of children is about 1-year old [
4
]. The performance of
age estimates may also have implications for human develop-
ment. [
5
] collected the speech of children. It can be seen that,
as children gradually enter puberty, changes in the vocal cords
*Corresponding author: Jianzong Wang, jzwang@188.com
can affect age estimates and increase uncertainty. In adulthood,
the vocal cords are fully developed and the change tends to be
slow. However, as we age, various organs experience regular
aging: the voice changes from bright to hoarse, and articula-
tion from clear to vague [
6
,
7
]. Judgments at different ages
also have different uncertainties, and these uncertainties may
vary from age to age, from utterance to utterance.
Traditional methods for speaker age estimation can be gen-
erally classified into classification-based and regression-based
methods. Most researchers mainly focus on the exploration
of backbone model structures, such as deep neural network
(DNN) [
8
], i-vector [
9
], x-vector [
10
,
11
] or adding atten-
tion mechanism [
12
]. Some researchers have tried different
machine learning features, such as the OpenSmile toolbox
[
13
] to study this problem [
14
,
15
]. As manipulated acous-
tic features, such as mel-filter banks, encounter performance
bottlenecks, some researchers use other speech features for
modeling, which can capture acoustic features that are im-
perceptible to the human ear, such as SincNet [
16
] take full
advantage of acoustic information, resulting in improved per-
formance. However, these features are only direct translations
of speech signals, not language models for understanding hu-
man speech. Self-supervised learning (SSL) generates high-
quality speech features with language model (such as wav2vec
[
17
] and WavLM [
18
]) by learning from a large amount of data
[
19
]. By injecting this prior knowledge, speech age estimation
achieves better performance [
20
]. Although these methods
have achieved great results, they ignored the fact that it rarely
considers the relationship between labels, such as order and
adjacent correlations, which are important clues for speaker
age estimation. Since speaker age labels form an ordered set
of numbers, significant ordinal relationships and adjacencies
between labels should be fully exploited to achieve higher
performance.
Label distribution learning (LDL) [
21
] addresses the above
problems by transforming the classification problem into a
distribution learning task that minimizes the difference be-
tween the predicted and constructed Gaussian distributions of
labels. In the field of computer vision, impressive progress
has been made in facial age estimation, where LDL shows
great potential [
22
]. Framework [
3
] applied this method to
the speaker age recognition task and achieved good perfor-
mance. Since the uncertainty of each person is different, i.e.,
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.09524v2 [cs.SD] 16 Nov 2022