WEAK-SUPERVISED DYSARTHRIA-INV ARIANT FEATURES FOR SPOKEN LANGUAGE UNDERSTANDING USING AN FHV AE AND ADVERSARIAL TRAINING Jinzi Qi1 Hugo Van hamme1

2025-05-06 1 0 326.17KB 7 页 10玖币

侵权投诉

WEAK-SUPERVISED DYSARTHRIA-INVARIANT FEATURES FOR SPOKEN LANGUAGE

UNDERSTANDING USING AN FHVAE AND ADVERSARIAL TRAINING

Jinzi Qi1, Hugo Van hamme1

1KULeuven, Department Electrical Engineering-ESAT-PSI,

Kasteelpark Arenberg 10, Leuven, Belgium

ABSTRACT

The scarcity of training data and the large speaker vari-

ation in dysarthric speech lead to poor accuracy and poor

speaker generalization of spoken language understanding

systems for dysarthric speech. Through work on the speech

features, we focus on improving the model generalization

ability with limited dysarthric data. Factorized Hierarchi-

cal Variational Auto-Encoders (FHVAE) trained unsupervis-

edly have shown their advantage in disentangling content

and speaker representations. Earlier work showed that the

dysarthria shows in both feature vectors. Here, we add ad-

versarial training to bridge the gap between the control and

dysarthric speech data domains. We extract dysarthric and

speaker invariant features using weak supervision. The ex-

tracted features are evaluated on a Spoken Language Under-

standing task and yield a higher accuracy on unseen speakers

with more severe dysarthria compared to features from the

basic FHVAE model or plain ﬁlterbanks.

Index Terms—Dysarthric speech, FHVAE, adversarial

training, weak supervision, end-to-end spoken language un-

derstanding.

1. INTRODUCTION

Dysarthric speech data, especially accurately labeled data, are

scarce due to difﬁculties in recruitment, collection and label-

ing. This data insufﬁciency problem has been a constant bar-

rier for automatic speech recognition (ASR) model training.

Recent ASR models[1, 2, 3] tend to be more complex and em-

ploy more parameters, which increases the difﬁculty to train

from limited dysarthric data even further.

Current research mainly solves the dysarthric speech data

deﬁciency problem in three ways. One solution is taking

abundant canonical (control) speech data into account during

training, e.g. pretraining on control speech and ﬁnetuning on

dysarthric speech [4, 5, 6]. A second approach is to decrease

the model size [7], or to train an inserted small module in-

stead of ﬁnetuning the whole model [8, 9], so the number of

parameters learned on the dysarthric data is limited. Thirdly

and differently from the solutions that work on training strat-

egy or model structure, [10, 11, 12, 13] focus directly on the

data and do augmentation to generate more dysarthric speech

for use in training.

The features used in most methods preserve the speaker

characteristics and the dysarthria information, and thus bound

the model generalization ability. However, in practical sce-

narios, we would like the ASR model to have great general-

ization ability and work well on unseen speakers (users), to

reduce any training effort for dysarthric users. In this work,

we focus on improving the automatic recognition model gen-

eralization ability with limited dysarthric data. Instead of

training and testing the model with speech features that pre-

serve variability between dysarthric speakers, we propose to

use dysarthric-invariant and speaker independent speech fea-

tures for automatic recognition use, which can both boost the

model generalization ability on unseen speakers and gives the

possibility to involve abundant canonical speech into training.

Factorized Hierarchical Variational Auto-Encoders (FH-

VAE) [14, 15] have shown their ability to disentangle the con-

tent space and the speaker space in input speech. An FHVAE

models the generative process of a sequence of segments in

a hierarchical structure. It encodes a speech utterance into a

segment-related variable (short time scale) and a sequence-

related variable (long time scale) via two linked encoders.

The segment-related latent variable represents the informa-

tion that only appears in a single segment in the sequence,

such as the content of the segment, and is conditioned on the

sequence-related variable. The sequence-related variable re-

ﬂects the features of the whole sequence, like the acoustic

environment or speaker characteristics. The model hence of-

fers a separation between speaker characteristics (sequence)

and content (segment). The content variable (segment-related

variable) appears to ﬁt the speaker-independence desideratum

outlined above. However, our previous work [16] shows that

the dysarthria is reﬂected in both latent variables.

To achieve a greater deal of dysarthria-invariance in the

content variable, instead of using forced regularization with

phoneme alignment [16], inspired by [17], we introduce ad-

versarial training [18] to the FHVAE model. We feed the FH-

VAE model with both control and dysarthric speech, regard-

ing the content variable encoder as the generator. We add a

discriminator that tries to distinguish control from dysarthric

content variables. With adversarial training between the gen-

Copyright 2023 IEEE. This is a preprint version. The original version is to be published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 09-12 January 2023 in Doha, Qatar. Personal use of this material is

permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other

works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

arXiv:2210.13144v1 [eess.AS] 24 Oct 2022

erator and the discriminator, we can force the content variable

encoder (generator) to extract content information from both

control and dysarthric speech that shares the same dysarthria-

invariant latent space. To make the extracted variable closer

to the control data space and improve disentanglement, we

also try two additional loss terms called reference loss and

disentanglement loss (see section 2). Note that we only need

the speaker type label as weak supervision of the training of

the complete proposed model. The content encoder is used

by both dysarthric speech and canonical speech, which gives

opportunities to involve abundant canonical speech features

into the downstream model (ASR) training in future work.

For evaluation, we use the trained content encoder to ex-

tract the content variables from other datasets and use these

variables to train and test a mature End-to-End (E2E) Spoken

Language Understanding (SLU) system [19, 20]. The system

consists of a recurrent neural network (RNN) encoder, a cap-

sule network [21] and output layers, and can transfer input

speech features into an intent (semantic) representation with-

out using an explicit textual representation. E2E SLU evalua-

tion is different from the evaluation on speaker-independent,

in the sense that we have (a limited number of) task-speciﬁc

recordings by dysarthric speakers at our disposal for training

the SLU system, whereas ASR systems would typically be

trained on task-agnostic data. Because of the domain adapta-

tion inherent in the evaluation, it is harder to show generaliza-

tion properties on E2E SLU tasks.

Apart from SLU evaluation, we also train and test a binary

speaker type classiﬁer (control/dysarthria) using both content

variables and sequence-related variables, and use the classiﬁ-

cation accuracy to show how the dysarthria information dis-

tribution changes after the adversarial training.

We introduce the feature extraction methods, the FHVAE

model with adversarial training, reference loss and disentan-

glement loss, in section 2. Section 3 describes the data and the

experimental settings. Results and analysis will be provided

in section 4, and section 5 gives conclusions.

2. METHOD

In this section, we introduce the feature extraction methods:

the FHAVE model extended with adversarial training, refer-

ence loss and disentanglement loss.

2.1. FHVAE with adversarial training

Suppose we have Ispeech data sequences, each sequence Xi

containing Nisegments xi,n,n= 1,2, ..., Ni. The segment

xi,n will be represented by two latent variables zi,n

1and zi,n

via an FHVAE:

q(zi,n

2|xi,n) = N(Encµz2(xi,n),Encσ2

z2(xi,n)) (1)

q(zi,n

1|xi,n,zi,n

2) = N(Encµz1(xi,n,zi,n

2),Encσ2

z1(xi,n,zi,n

(2)

where Encµz2(·)and Encσ2

z2(·)are the networks encoding

input xto mean and variance of z2,Encµz1(·)and Encσ2

z1(·)

are the encoders of z1. Following the theory of variational

auto-encoders, the latent variables should follow prior distri-

butions:

p(zi,n

1) = N(0, σ2

z1I)(3)

p(zi,n

2|µi

2) = N(µi

2, σ2

z2I), p(µi

2) = N(0, σ2

µ2I)(4)

where σ2

z1, σ2

z2and σ2

µ2are chosen upfront and zi,n

1,zi,n

2are

i.i.d samples. The prior enforces variables zi,n

2within the

same sequence ito be close.

The zi,n

1, conditioned on zi,n

2(see equation2), only con-

tains information with the speech segment xi,n, and is re-

garded as the speaker-independent content variable. The vari-

able zi,n

2reﬂects the features of the whole sequence Xiand

can be seen as speaker variable. The mean of sequence-level

variable µi

2can be inferred as:

˜µi

2=PN

n=1 Encµz2(xi,n)

N+σ2

σ2

µ2

(5)

Then for the whole sequence Xi, the inference model is

written as:

q(Zi

1,Zi

2,µi

2|Xi) = q(µi

2|xi,n)

n=1

q(zi,n

1|xi,n)q(zi,n

2|xi,n)

(6)

where Zi

1={zi,n

1}N

n=1,Zi

2={zi,n

2}N

n=1. The generative

model for the data is then:

p(xi,n|zi,n

1,zi,n

2) = N(Decµx(zi,n

1,zi,n

2),Decσ2

x(zi,n

1,zi,n

2))

(7)

where Decµx(·)and Decσ2

x(·)form the decoder networks of

the FHVAE.

The dysarthria information, as a speaker characteristic,

should be extracted to speaker variable zi,n

2by the FHVAE.

However, our previous work [16] found that the FHVAE

does not separate the dysarthria and content information and

speech impairment is identiﬁable from zi,n

To obtain dysarthria-invariant features zi,n

2, inspired by

[17], we introduce adversarial training into the FHVAE

model. The data ﬂow of the proposed model is provided

in ﬁgure 1.

Repurposing the z1mean encoder Encµz1(·)as the gener-

ator in adversarial training, we get µi,n

z1as the generator out-

put and feed this into a discriminator D(·)which is trained

to yield the probability pi,n

Dthat a speech segment xi,n stems

from a control speaker:

pi,n

D=D(µi,n

z1) = D(Encµz1(xi,n)) (8)

Training of D(·)involves dysarthric and control data and

minimizes cross-entropy:

LD=CEdys +CEctrl (9)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WEAK-SUPERVISEDDYSARTHRIA-INVARIANTFEATURESFORSPOKENLANGUAGEUNDERSTANDINGUSINGANFHVAEANDADVERSARIALTRAININGJinziQi1,HugoVanhamme11KULeuven,DepartmentElectricalEngineering-ESAT-PSI,KasteelparkArenberg10,Leuven,BelgiumABSTRACTThescarcityoftrainingdataandthelargespeakervari-ationindysarthricspeechleadt...

展开>> 收起<<

WEAK-SUPERVISED DYSARTHRIA-INV ARIANT FEATURES FOR SPOKEN LANGUAGE UNDERSTANDING USING AN FHV AE AND ADVERSARIAL TRAINING Jinzi Qi1 Hugo Van hamme1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WEAK-SUPERVISED DYSARTHRIA-INV ARIANT FEATURES FOR SPOKEN LANGUAGE UNDERSTANDING USING AN FHV AE AND ADVERSARIAL TRAINING Jinzi Qi1 Hugo Van hamme1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: