SVLDL IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE V ARIANCE LABEL DISTRIBUTION LEARNING Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao

2025-05-02 0 0 502.74KB 8 页 10玖币
侵权投诉
SVLDL: IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE VARIANCE LABEL
DISTRIBUTION LEARNING
Zuheng Kang, Jianzong Wang*, Junqing Peng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
ABSTRACT
Estimating age from a single speech is a classic and chal-
lenging topic. Although Label Distribution Learning (LDL)
can represent adjacent indistinguishable ages well, the uncer-
tainty of the age estimate for each utterance varies from person
to person, i.e., the variance of the age distribution is differ-
ent. To address this issue, we propose selective variance label
distribution learning (SVLDL) method to adapt the variance
of different age distributions. Furthermore, the model uses
WavLM as the speech feature extractor and adds the auxiliary
task of gender recognition to further improve the performance.
Two tricks are applied on the loss function to enhance the
robustness of the age estimation and improve the quality of the
fitted age distribution. Extensive experiments show that the
model achieves state-of-the-art performance on all aspects of
the NIST SRE08-10 and a real-world datasets.
Index Terms
speaker age estimation, label distribution
learning, multi-task learning, gender recognition
1. INTRODUCTION
Speech is the sound produced by the accurate coordinated
movement of multiple organs in the human body. Hence, the
acoustic characteristics of speech can transmit information
about the physical characteristics of the speaker. The rapid
development of new speech applications requires techniques
capable of estimating information on various biological at-
tributes of such speakers. Recently, deep-learning-based ap-
proaches show great performance in extracting hidden speech
information, including facial expression [
1
] and emotion [
2
],
and age [
3
], etc. If such speech features can be used to auto-
matically estimate a speaker’s age, it could be widely used for
human-computer interaction, forensics, and other purposes.
Many researchers have studied the performance of hu-
man and artificial intelligence systems in estimating age from
speech. The results show that the average error of humans judg-
ing the age of adults is about 10 years old, and the judgment of
the age of children is about 1-year old [
4
]. The performance of
age estimates may also have implications for human develop-
ment. [
5
] collected the speech of children. It can be seen that,
as children gradually enter puberty, changes in the vocal cords
*Corresponding author: Jianzong Wang, jzwang@188.com
can affect age estimates and increase uncertainty. In adulthood,
the vocal cords are fully developed and the change tends to be
slow. However, as we age, various organs experience regular
aging: the voice changes from bright to hoarse, and articula-
tion from clear to vague [
6
,
7
]. Judgments at different ages
also have different uncertainties, and these uncertainties may
vary from age to age, from utterance to utterance.
Traditional methods for speaker age estimation can be gen-
erally classified into classification-based and regression-based
methods. Most researchers mainly focus on the exploration
of backbone model structures, such as deep neural network
(DNN) [
8
], i-vector [
9
], x-vector [
10
,
11
] or adding atten-
tion mechanism [
12
]. Some researchers have tried different
machine learning features, such as the OpenSmile toolbox
[
13
] to study this problem [
14
,
15
]. As manipulated acous-
tic features, such as mel-filter banks, encounter performance
bottlenecks, some researchers use other speech features for
modeling, which can capture acoustic features that are im-
perceptible to the human ear, such as SincNet [
16
] take full
advantage of acoustic information, resulting in improved per-
formance. However, these features are only direct translations
of speech signals, not language models for understanding hu-
man speech. Self-supervised learning (SSL) generates high-
quality speech features with language model (such as wav2vec
[
17
] and WavLM [
18
]) by learning from a large amount of data
[
19
]. By injecting this prior knowledge, speech age estimation
achieves better performance [
20
]. Although these methods
have achieved great results, they ignored the fact that it rarely
considers the relationship between labels, such as order and
adjacent correlations, which are important clues for speaker
age estimation. Since speaker age labels form an ordered set
of numbers, significant ordinal relationships and adjacencies
between labels should be fully exploited to achieve higher
performance.
Label distribution learning (LDL) [
21
] addresses the above
problems by transforming the classification problem into a
distribution learning task that minimizes the difference be-
tween the predicted and constructed Gaussian distributions of
labels. In the field of computer vision, impressive progress
has been made in facial age estimation, where LDL shows
great potential [
22
]. Framework [
3
] applied this method to
the speaker age recognition task and achieved good perfor-
mance. Since the uncertainty of each person is different, i.e.,
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.09524v2 [cs.SD] 16 Nov 2022
Speech Signal Backbone Backend
...
WavLM
Transformer
Layers
1
w
2
w
L
w
... Combined
Feature
ECAPA-TDNN
Input Features
Attentive
Statistics
Pooling
FC
1024 FC
Age
Softmax age
probability
Select
Best
Match
Select Minimum-KL Divergence
ˆ
y
CCC Loss
KL-Div.
Variance Loss
( )
f
L T C´ ´
( )
f
T C´
( )
1024
( )
K
Focal Loss
FC
Gender
MTL
x z
F
Shape:
s
y
m
( )
2
( )
ˆ
,
vy yL
( )
ˆ
,
gg gL
ˆ
g
( )
ˆ
,
c
m m
L
ˆ
m
( )
3
e
T C´
( )
6
e
C
( ) ( )
KL Diff
ˆ ˆ
, , ,y y y yL L
Fig. 1. Network topology of the SVLDL framework. “FC” denotes a fully connected layer. denotes element-wise addition.
the variance of the Gaussian distribution varies from person
to person, adaptive-based LDL methods have been proposed
successively [
23
,
24
,
25
]. However, loss functions that mea-
sure regression error often use simple metrics, such as L1
distance, which are not dynamically adjusted for a specific
distribution at training time. This method does not achieve
optimal regression performance. Meanwhile, these algorithms
do not get the correct shape of the learned distribution, which
may lead to multimodal problems (multiple peaks in the fitted
distribution).
Additionally, Multi-task learning (MTL) uses a shared
backbone model to simultaneously optimize objectives for dif-
ferent tasks. The advantage comes from adding more useful
information while optimizing the original model. In speaker
age estimation, adding the task of gender recognition has been
shown to improve performance [
20
,
26
]. Meanwhile, in re-
gression problems, Lin’s consistent correlation coefficient loss
[
27
] also achieves a lot of performance gains by replacing L1
or L2 distance-based losses.
Considering the above advantages and disadvantages, we
have made the following improvements and contributions:
We improve the original label distribution learning
(LDL) method and propose a new selective variance
label distribution learning (SVLDL) method that adap-
tively selects the optimal distribution that matches the
variance.
The quality of fitted distributions is improved by fitting
additional first-order difference distribution, and a brief
theoretical proof is given.
The age estimation performance is enhanced by using
Lin’s concordance correlation coefficient [27] loss.
The performance was improved by adding an auxiliary
task for gender recognition and using WavLM as the
speech feature extractor.
Experimental results on the publicly available NIST
SRE08-10 dataset and a real-world dataset show that the
improved SVLDL framework achieves state-of-the-art
performance compared to the framework [3].
2. METHODOLOGY
2.1. Network Architecture
Figure 1 outlines the pipeline of the proposed method. Since
the structure of ECAPA-TDNN [
28
] has an efficient design
structure, such as Res2Net [
29
] and squeeze excitation blocks
(SE) [
30
], it is used as the backbone model. All the information
on the time dimension is collected through attentive statistics
pooling (SP). After the SP, there are two fully connected layers,
and finally a softmax layer is connected to obtain the output
distribution of the labels, denoted as
y
; the output of the middle
layer is denoted as
z
, which is also used as input for the
auxiliary task of gender recognition.
2.2. Self-supervised Representation
Motivated by the successful application of self-supervised
learning (SSL) in various speech domains, we explore the use
of WavLM [
18
] on the task of speaker age estimation. The
WavLM model learns speech representations by solving con-
trastive tasks in a latent space in a self-supervised manner. It
tries to recover the randomly masked part of the encoded audio
features. By learning from large amounts of real multilingual,
multi-channel unlabeled data, SSL models can deeply under-
stand contextual information and produce high-quality speech
representations in the latent space.
In our framework, seen from Figure 1, we utilize all latent
output of WavLM transformer layers
Φ = (φ1, ..., φL)
and
assign a trainable weight
W= (w1, ..., wL)
to each of them.
The weighted sum is then used to generate speech features
x=PL
i=1 (φi·wi)
, where
ΦRL×T×Cf
,
xRT×Cf
,
T
is
number of time frames,
Cf
is the feature size,
L
is the number
of layers of WavLM. In this way, the model can make full use
of speech information from shallow to deep, from concrete to
abstract.
2.3. Label Distribution Learning
Before introducing SVLDL, we need to know how LDL works
and understand some parameters,
ˆµn
and
ˆσn
are the mean
and standard deviation of the predicted distribution, and
µn
摘要:

SVLDL:IMPROVEDSPEAKERAGEESTIMATIONUSINGSELECTIVEVARIANCELABELDISTRIBUTIONLEARNINGZuhengKang,JianzongWang*,JunqingPeng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.ABSTRACTEstimatingagefromasinglespeechisaclassicandchal-lengingtopic.AlthoughLabelDistributionLearning(LDL)canrepresentadjacentindistinguish...

展开>> 收起<<
SVLDL IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE V ARIANCE LABEL DISTRIBUTION LEARNING Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:502.74KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注