SVLDL IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE V ARIANCE LABEL DISTRIBUTION LEARNING Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao

2025-05-02 0 0 502.74KB 8 页 10玖币

侵权投诉

SVLDL: IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE VARIANCE LABEL

DISTRIBUTION LEARNING

Zuheng Kang, Jianzong Wang*, Junqing Peng, Jing Xiao

Ping An Technology (Shenzhen) Co., Ltd.

ABSTRACT

Estimating age from a single speech is a classic and chal-

lenging topic. Although Label Distribution Learning (LDL)

can represent adjacent indistinguishable ages well, the uncer-

tainty of the age estimate for each utterance varies from person

to person, i.e., the variance of the age distribution is differ-

ent. To address this issue, we propose selective variance label

distribution learning (SVLDL) method to adapt the variance

of different age distributions. Furthermore, the model uses

WavLM as the speech feature extractor and adds the auxiliary

task of gender recognition to further improve the performance.

Two tricks are applied on the loss function to enhance the

robustness of the age estimation and improve the quality of the

ﬁtted age distribution. Extensive experiments show that the

model achieves state-of-the-art performance on all aspects of

the NIST SRE08-10 and a real-world datasets.

Index Terms—

speaker age estimation, label distribution

learning, multi-task learning, gender recognition

1. INTRODUCTION

Speech is the sound produced by the accurate coordinated

movement of multiple organs in the human body. Hence, the

acoustic characteristics of speech can transmit information

about the physical characteristics of the speaker. The rapid

development of new speech applications requires techniques

capable of estimating information on various biological at-

tributes of such speakers. Recently, deep-learning-based ap-

proaches show great performance in extracting hidden speech

information, including facial expression [

] and emotion [

and age [

], etc. If such speech features can be used to auto-

matically estimate a speaker’s age, it could be widely used for

human-computer interaction, forensics, and other purposes.

Many researchers have studied the performance of hu-

man and artiﬁcial intelligence systems in estimating age from

speech. The results show that the average error of humans judg-

ing the age of adults is about 10 years old, and the judgment of

the age of children is about 1-year old [

]. The performance of

age estimates may also have implications for human develop-

ment. [

] collected the speech of children. It can be seen that,

as children gradually enter puberty, changes in the vocal cords

*Corresponding author: Jianzong Wang, jzwang@188.com

can affect age estimates and increase uncertainty. In adulthood,

the vocal cords are fully developed and the change tends to be

slow. However, as we age, various organs experience regular

aging: the voice changes from bright to hoarse, and articula-

tion from clear to vague [

]. Judgments at different ages

also have different uncertainties, and these uncertainties may

vary from age to age, from utterance to utterance.

Traditional methods for speaker age estimation can be gen-

erally classiﬁed into classiﬁcation-based and regression-based

methods. Most researchers mainly focus on the exploration

of backbone model structures, such as deep neural network

(DNN) [

], i-vector [

], x-vector [

] or adding atten-

tion mechanism [

]. Some researchers have tried different

machine learning features, such as the OpenSmile toolbox

[

] to study this problem [

]. As manipulated acous-

tic features, such as mel-ﬁlter banks, encounter performance

bottlenecks, some researchers use other speech features for

modeling, which can capture acoustic features that are im-

perceptible to the human ear, such as SincNet [

] take full

advantage of acoustic information, resulting in improved per-

formance. However, these features are only direct translations

of speech signals, not language models for understanding hu-

man speech. Self-supervised learning (SSL) generates high-

quality speech features with language model (such as wav2vec

[

] and WavLM [

]) by learning from a large amount of data

[

]. By injecting this prior knowledge, speech age estimation

achieves better performance [

]. Although these methods

have achieved great results, they ignored the fact that it rarely

considers the relationship between labels, such as order and

adjacent correlations, which are important clues for speaker

age estimation. Since speaker age labels form an ordered set

of numbers, signiﬁcant ordinal relationships and adjacencies

between labels should be fully exploited to achieve higher

performance.

Label distribution learning (LDL) [

] addresses the above

problems by transforming the classiﬁcation problem into a

distribution learning task that minimizes the difference be-

tween the predicted and constructed Gaussian distributions of

labels. In the ﬁeld of computer vision, impressive progress

has been made in facial age estimation, where LDL shows

great potential [

]. Framework [

] applied this method to

the speaker age recognition task and achieved good perfor-

mance. Since the uncertainty of each person is different, i.e.,

arXiv:2210.09524v2 [cs.SD] 16 Nov 2022

Speech Signal Backbone Backend

...

WavLM

Transformer

Layers

... Combined

Feature

ECAPA-TDNN

Input Features

Attentive

Statistics

Pooling

1024 FC

Age

Softmax age

probability

Select

Best

Match

Select Minimum-KL Divergence

CCC Loss

KL-Div.

Variance Loss

( )

L T C´ ´

( )

T C´

( )

1024

( )

Focal Loss

Gender

MTL

x z

Shape:

( )

vy yL

( )

gg gL

( )

m m

( )

T C´

( )

( ) ( )

KL Diff

ˆ ˆ

, , ,y y y yL L

Fig. 1. Network topology of the SVLDL framework. “FC” denotes a fully connected layer. ⊕denotes element-wise addition.

the variance of the Gaussian distribution varies from person

to person, adaptive-based LDL methods have been proposed

successively [

]. However, loss functions that mea-

sure regression error often use simple metrics, such as L1

distance, which are not dynamically adjusted for a speciﬁc

distribution at training time. This method does not achieve

optimal regression performance. Meanwhile, these algorithms

do not get the correct shape of the learned distribution, which

may lead to multimodal problems (multiple peaks in the ﬁtted

distribution).

Additionally, Multi-task learning (MTL) uses a shared

backbone model to simultaneously optimize objectives for dif-

ferent tasks. The advantage comes from adding more useful

information while optimizing the original model. In speaker

age estimation, adding the task of gender recognition has been

shown to improve performance [

]. Meanwhile, in re-

gression problems, Lin’s consistent correlation coefﬁcient loss

[

] also achieves a lot of performance gains by replacing L1

or L2 distance-based losses.

Considering the above advantages and disadvantages, we

have made the following improvements and contributions:

•

We improve the original label distribution learning

(LDL) method and propose a new selective variance

label distribution learning (SVLDL) method that adap-

tively selects the optimal distribution that matches the

variance.

•

The quality of ﬁtted distributions is improved by ﬁtting

additional ﬁrst-order difference distribution, and a brief

theoretical proof is given.

•

The age estimation performance is enhanced by using

Lin’s concordance correlation coefﬁcient [27] loss.

•

The performance was improved by adding an auxiliary

task for gender recognition and using WavLM as the

speech feature extractor.

•

Experimental results on the publicly available NIST

SRE08-10 dataset and a real-world dataset show that the

improved SVLDL framework achieves state-of-the-art

performance compared to the framework [3].

2. METHODOLOGY

2.1. Network Architecture

Figure 1 outlines the pipeline of the proposed method. Since

the structure of ECAPA-TDNN [

] has an efﬁcient design

structure, such as Res2Net [

] and squeeze excitation blocks

(SE) [

], it is used as the backbone model. All the information

on the time dimension is collected through attentive statistics

pooling (SP). After the SP, there are two fully connected layers,

and ﬁnally a softmax layer is connected to obtain the output

distribution of the labels, denoted as

; the output of the middle

layer is denoted as

, which is also used as input for the

auxiliary task of gender recognition.

2.2. Self-supervised Representation

Motivated by the successful application of self-supervised

learning (SSL) in various speech domains, we explore the use

of WavLM [

] on the task of speaker age estimation. The

WavLM model learns speech representations by solving con-

trastive tasks in a latent space in a self-supervised manner. It

tries to recover the randomly masked part of the encoded audio

features. By learning from large amounts of real multilingual,

multi-channel unlabeled data, SSL models can deeply under-

stand contextual information and produce high-quality speech

representations in the latent space.

In our framework, seen from Figure 1, we utilize all latent

output of WavLM transformer layers

Φ = (φ1, ..., φL)

and

assign a trainable weight

W= (w1, ..., wL)

to each of them.

The weighted sum is then used to generate speech features

x=PL

i=1 (φi·wi)

, where

Φ∈RL×T×Cf

x∈RT×Cf

number of time frames,

is the feature size,

is the number

of layers of WavLM. In this way, the model can make full use

of speech information from shallow to deep, from concrete to

abstract.

2.3. Label Distribution Learning

Before introducing SVLDL, we need to know how LDL works

and understand some parameters,

ˆµn

and

ˆσn

are the mean

and standard deviation of the predicted distribution, and

µn

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SVLDL:IMPROVEDSPEAKERAGEESTIMATIONUSINGSELECTIVEVARIANCELABELDISTRIBUTIONLEARNINGZuhengKang,JianzongWang*,JunqingPeng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.ABSTRACTEstimatingagefromasinglespeechisaclassicandchal-lengingtopic.AlthoughLabelDistributionLearning(LDL)canrepresentadjacentindistinguish...

展开>> 收起<<

SVLDL IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE V ARIANCE LABEL DISTRIBUTION LEARNING Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SVLDL IMPROVED SPEAKER AGE ESTIMATION USING SELECTIVE V ARIANCE LABEL DISTRIBUTION LEARNING Zuheng Kang Jianzong Wang Junqing Peng Jing Xiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: