WEAK-SUPERVISED DYSARTHRIA-INV ARIANT FEATURES FOR SPOKEN LANGUAGE UNDERSTANDING USING AN FHV AE AND ADVERSARIAL TRAINING Jinzi Qi1 Hugo Van hamme1

2025-05-06 0 0 326.17KB 7 页 10玖币
侵权投诉
WEAK-SUPERVISED DYSARTHRIA-INVARIANT FEATURES FOR SPOKEN LANGUAGE
UNDERSTANDING USING AN FHVAE AND ADVERSARIAL TRAINING
Jinzi Qi1, Hugo Van hamme1
1KULeuven, Department Electrical Engineering-ESAT-PSI,
Kasteelpark Arenberg 10, Leuven, Belgium
ABSTRACT
The scarcity of training data and the large speaker vari-
ation in dysarthric speech lead to poor accuracy and poor
speaker generalization of spoken language understanding
systems for dysarthric speech. Through work on the speech
features, we focus on improving the model generalization
ability with limited dysarthric data. Factorized Hierarchi-
cal Variational Auto-Encoders (FHVAE) trained unsupervis-
edly have shown their advantage in disentangling content
and speaker representations. Earlier work showed that the
dysarthria shows in both feature vectors. Here, we add ad-
versarial training to bridge the gap between the control and
dysarthric speech data domains. We extract dysarthric and
speaker invariant features using weak supervision. The ex-
tracted features are evaluated on a Spoken Language Under-
standing task and yield a higher accuracy on unseen speakers
with more severe dysarthria compared to features from the
basic FHVAE model or plain filterbanks.
Index TermsDysarthric speech, FHVAE, adversarial
training, weak supervision, end-to-end spoken language un-
derstanding.
1. INTRODUCTION
Dysarthric speech data, especially accurately labeled data, are
scarce due to difficulties in recruitment, collection and label-
ing. This data insufficiency problem has been a constant bar-
rier for automatic speech recognition (ASR) model training.
Recent ASR models[1, 2, 3] tend to be more complex and em-
ploy more parameters, which increases the difficulty to train
from limited dysarthric data even further.
Current research mainly solves the dysarthric speech data
deficiency problem in three ways. One solution is taking
abundant canonical (control) speech data into account during
training, e.g. pretraining on control speech and finetuning on
dysarthric speech [4, 5, 6]. A second approach is to decrease
the model size [7], or to train an inserted small module in-
stead of finetuning the whole model [8, 9], so the number of
parameters learned on the dysarthric data is limited. Thirdly
and differently from the solutions that work on training strat-
egy or model structure, [10, 11, 12, 13] focus directly on the
data and do augmentation to generate more dysarthric speech
for use in training.
The features used in most methods preserve the speaker
characteristics and the dysarthria information, and thus bound
the model generalization ability. However, in practical sce-
narios, we would like the ASR model to have great general-
ization ability and work well on unseen speakers (users), to
reduce any training effort for dysarthric users. In this work,
we focus on improving the automatic recognition model gen-
eralization ability with limited dysarthric data. Instead of
training and testing the model with speech features that pre-
serve variability between dysarthric speakers, we propose to
use dysarthric-invariant and speaker independent speech fea-
tures for automatic recognition use, which can both boost the
model generalization ability on unseen speakers and gives the
possibility to involve abundant canonical speech into training.
Factorized Hierarchical Variational Auto-Encoders (FH-
VAE) [14, 15] have shown their ability to disentangle the con-
tent space and the speaker space in input speech. An FHVAE
models the generative process of a sequence of segments in
a hierarchical structure. It encodes a speech utterance into a
segment-related variable (short time scale) and a sequence-
related variable (long time scale) via two linked encoders.
The segment-related latent variable represents the informa-
tion that only appears in a single segment in the sequence,
such as the content of the segment, and is conditioned on the
sequence-related variable. The sequence-related variable re-
flects the features of the whole sequence, like the acoustic
environment or speaker characteristics. The model hence of-
fers a separation between speaker characteristics (sequence)
and content (segment). The content variable (segment-related
variable) appears to fit the speaker-independence desideratum
outlined above. However, our previous work [16] shows that
the dysarthria is reflected in both latent variables.
To achieve a greater deal of dysarthria-invariance in the
content variable, instead of using forced regularization with
phoneme alignment [16], inspired by [17], we introduce ad-
versarial training [18] to the FHVAE model. We feed the FH-
VAE model with both control and dysarthric speech, regard-
ing the content variable encoder as the generator. We add a
discriminator that tries to distinguish control from dysarthric
content variables. With adversarial training between the gen-
Copyright 2023 IEEE. This is a preprint version. The original version is to be published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 09-12 January 2023 in Doha, Qatar. Personal use of this material is
permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other
works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
arXiv:2210.13144v1 [eess.AS] 24 Oct 2022
erator and the discriminator, we can force the content variable
encoder (generator) to extract content information from both
control and dysarthric speech that shares the same dysarthria-
invariant latent space. To make the extracted variable closer
to the control data space and improve disentanglement, we
also try two additional loss terms called reference loss and
disentanglement loss (see section 2). Note that we only need
the speaker type label as weak supervision of the training of
the complete proposed model. The content encoder is used
by both dysarthric speech and canonical speech, which gives
opportunities to involve abundant canonical speech features
into the downstream model (ASR) training in future work.
For evaluation, we use the trained content encoder to ex-
tract the content variables from other datasets and use these
variables to train and test a mature End-to-End (E2E) Spoken
Language Understanding (SLU) system [19, 20]. The system
consists of a recurrent neural network (RNN) encoder, a cap-
sule network [21] and output layers, and can transfer input
speech features into an intent (semantic) representation with-
out using an explicit textual representation. E2E SLU evalua-
tion is different from the evaluation on speaker-independent,
in the sense that we have (a limited number of) task-specific
recordings by dysarthric speakers at our disposal for training
the SLU system, whereas ASR systems would typically be
trained on task-agnostic data. Because of the domain adapta-
tion inherent in the evaluation, it is harder to show generaliza-
tion properties on E2E SLU tasks.
Apart from SLU evaluation, we also train and test a binary
speaker type classifier (control/dysarthria) using both content
variables and sequence-related variables, and use the classifi-
cation accuracy to show how the dysarthria information dis-
tribution changes after the adversarial training.
We introduce the feature extraction methods, the FHVAE
model with adversarial training, reference loss and disentan-
glement loss, in section 2. Section 3 describes the data and the
experimental settings. Results and analysis will be provided
in section 4, and section 5 gives conclusions.
2. METHOD
In this section, we introduce the feature extraction methods:
the FHAVE model extended with adversarial training, refer-
ence loss and disentanglement loss.
2.1. FHVAE with adversarial training
Suppose we have Ispeech data sequences, each sequence Xi
containing Nisegments xi,n,n= 1,2, ..., Ni. The segment
xi,n will be represented by two latent variables zi,n
1and zi,n
2
via an FHVAE:
q(zi,n
2|xi,n) = N(Encµz2(xi,n),Encσ2
z2(xi,n)) (1)
q(zi,n
1|xi,n,zi,n
2) = N(Encµz1(xi,n,zi,n
2),Encσ2
z1(xi,n,zi,n
2)
(2)
where Encµz2(·)and Encσ2
z2(·)are the networks encoding
input xto mean and variance of z2,Encµz1(·)and Encσ2
z1(·)
are the encoders of z1. Following the theory of variational
auto-encoders, the latent variables should follow prior distri-
butions:
p(zi,n
1) = N(0, σ2
z1I)(3)
p(zi,n
2|µi
2) = N(µi
2, σ2
z2I), p(µi
2) = N(0, σ2
µ2I)(4)
where σ2
z1, σ2
z2and σ2
µ2are chosen upfront and zi,n
1,zi,n
2are
i.i.d samples. The prior enforces variables zi,n
2within the
same sequence ito be close.
The zi,n
1, conditioned on zi,n
2(see equation2), only con-
tains information with the speech segment xi,n, and is re-
garded as the speaker-independent content variable. The vari-
able zi,n
2reflects the features of the whole sequence Xiand
can be seen as speaker variable. The mean of sequence-level
variable µi
2can be inferred as:
˜µi
2=PN
n=1 Encµz2(xi,n)
N+σ2
z2
σ2
µ2
(5)
Then for the whole sequence Xi, the inference model is
written as:
q(Zi
1,Zi
2,µi
2|Xi) = q(µi
2|xi,n)
N
Y
n=1
q(zi,n
1|xi,n)q(zi,n
2|xi,n)
(6)
where Zi
1={zi,n
1}N
n=1,Zi
2={zi,n
2}N
n=1. The generative
model for the data is then:
p(xi,n|zi,n
1,zi,n
2) = N(Decµx(zi,n
1,zi,n
2),Decσ2
x(zi,n
1,zi,n
2))
(7)
where Decµx(·)and Decσ2
x(·)form the decoder networks of
the FHVAE.
The dysarthria information, as a speaker characteristic,
should be extracted to speaker variable zi,n
2by the FHVAE.
However, our previous work [16] found that the FHVAE
does not separate the dysarthria and content information and
speech impairment is identifiable from zi,n
1.
To obtain dysarthria-invariant features zi,n
2, inspired by
[17], we introduce adversarial training into the FHVAE
model. The data flow of the proposed model is provided
in figure 1.
Repurposing the z1mean encoder Encµz1(·)as the gener-
ator in adversarial training, we get µi,n
z1as the generator out-
put and feed this into a discriminator D(·)which is trained
to yield the probability pi,n
Dthat a speech segment xi,n stems
from a control speaker:
pi,n
D=D(µi,n
z1) = D(Encµz1(xi,n)) (8)
Training of D(·)involves dysarthric and control data and
minimizes cross-entropy:
LD=CEdys +CEctrl (9)
摘要:

WEAK-SUPERVISEDDYSARTHRIA-INVARIANTFEATURESFORSPOKENLANGUAGEUNDERSTANDINGUSINGANFHVAEANDADVERSARIALTRAININGJinziQi1,HugoVanhamme11KULeuven,DepartmentElectricalEngineering-ESAT-PSI,KasteelparkArenberg10,Leuven,BelgiumABSTRACTThescarcityoftrainingdataandthelargespeakervari-ationindysarthricspeechleadt...

展开>> 收起<<
WEAK-SUPERVISED DYSARTHRIA-INV ARIANT FEATURES FOR SPOKEN LANGUAGE UNDERSTANDING USING AN FHV AE AND ADVERSARIAL TRAINING Jinzi Qi1 Hugo Van hamme1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:7 页 大小:326.17KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注