AN EXPERIMENTAL STUDY ON PRIV ATE AGGREGATION OF TEACHER ENSEMBLE LEARNING FOR END-TO-END SPEECH RECOGNITION Chao-Han Huck Yang12I-Fan Chen2Andreas Stolcke2

2025-04-24 0 0 552.31KB 7 页 10玖币
侵权投诉
AN EXPERIMENTAL STUDY ON PRIVATE AGGREGATION OF TEACHER ENSEMBLE
LEARNING FOR END-TO-END SPEECH RECOGNITION
Chao-Han Huck Yang1,2I-Fan Chen2Andreas Stolcke2
Sabato Marco Siniscalchi1,3Chin-Hui Lee1
1Georgia Institute of Technology, USA and 2Amazon Alexa AI, USA
3Department of Electronic Systems, NTNU, Trondheim, Norway
ABSTRACT
Differential privacy (DP) is one data protection avenue to
safeguard user information used for training deep models by
imposing noisy distortion on privacy data. Such a noise per-
turbation often results in a severe performance degradation
in automatic speech recognition (ASR) in order to meet a
privacy budget ε. Private aggregation of teacher ensemble
(PATE) utilizes ensemble probabilities to improve ASR accu-
racy when dealing with the noise effects controlled by small
values of ε. We extend PATE learning to work with dynamic
patterns, namely speech utterances, and perform a first exper-
imental demonstration that it prevents acoustic data leakage
in ASR training. We evaluate three end-to-end deep models,
including LAS, hybrid CTC/attention, and RNN transducer,
on the open-source LibriSpeech and TIMIT corpora. PATE
learning-enhanced ASR models outperform the benchmark
DP-SGD mechanisms, especially under strict DP budgets,
giving relative word error rate reductions between 26.2% and
27.5% for an RNN transducer model evaluated with Lib-
riSpeech. We also introduce a DP-preserving ASR solution
for pretraining on public speech corpora.
Index Termsprivacy-preserving learning, automatic
speech recognition, teacher-student learning, ensemble train-
ing.
1. INTRODUCTION
Automatic speech recognition (ASR) [1] is widely used in
spoken language processing applications, such as smart de-
vice control [2], intelligent human-machine dialogue [3], and
spoken language understanding [4]. To build ASR systems, a
large collection of user speech [5] is often required for train-
ing high-performance acoustic models. Protecting informa-
tion privacy and measuring numerical privacy loss, such as
whether data from a specific user is used for model training,
are becoming critical and prominent research topics for on-
device speech applications [6, 7, 8, 9, 10, 11, 12].
Work done at Georgia Tech and Amazon. Parts of this study were com-
pleted while the first author was an intern at Amazon.
Differential privacy (DP) introduces a noise addition
scheme for information protection characterized by mea-
surable privacy budgets. The noise level is defined by a
privacy budget (e.g., controlled by a minimum εvalue) in
ε-DP [13, 14] for a given data set. Machine learning frame-
works based on ε-DP have been shown effective against
leakage of training data (e.g., human faces), model inversion
attacks [15] (MIA) and membership inference, in which a
query-free algorithm is used to generate highly confident test
data similar to its training set. Deploying ε-DP in speech ap-
plications based on deep models has recently attracted much
interest [16]. However, directly applying ε-DP perturbation
on the training data could lead to severe performance (e.g.,
prediction accuracy) degradation [17]. Therefore, the noise-
enabled protection process needs to be designed carefully for
incorporation into machine learning for training speech mod-
els. Most published DP-based approaches are still limited to
recognition of isolated spoken commands [7]. Designing a
continuous speech recognition system with ε-DP protection
needs further investigation under a variety of privacy settings.
In this work we use ε-DP protection to show that acoustic
features used in ASR training data can been protected against
model inversion attack (MIA) [15]. We also demonstrate that
ε-DP can prevent data leakage from a pretrained ASR model.
Private aggregation of teacher ensembles (PATE) [18]
learning is a recently proposed framework that aims to avoid
accuracy loss in large-scale visual prediction models while
guaranteeing ε-DP protection. PATE is guided by a teacher-
student learning process, where multiple teachers make up
an ensemble for knowledge transfer. The idea underlying
PATE is to benefit from aggregated noisy outputs of teacher
models to compartmentalize nonsensitive public data with
ε-DP protections. PATE and related approaches [19] were
proven effective in mitigating model accuracy deterioration
by employing an ensemble of teacher models with inde-
pendent noise perturbations. Nonetheless, for ASR systems
dealing with continuous speech, label stream prediction (e.g.,
by sequence level modeling) is required. In this study, we
apply the PATE framework to continuous speech recognition
and design different ensemble strategies to ensure ε-DP for
arXiv:2210.05614v2 [cs.SD] 13 Oct 2022
(c) Student
ASR Model
Publicly Accessible
(query-free)
Teacher
Models
Teacher
Models
(a) Teacher
ASR models
Unlabeled Public
Data (Non-Sensitive)
Sensitive Data
(e.g., Identity)
Label with noisy (differentially-private)
ensemble output
Knowledge Distillation &
Ensemble
Fig. 1: Proposed framework for utilizing private aggregation teacher ensemble (PATE) to train end-to-end ASR with ε-
differential privacy. Teacher ASR models could also be combined with pretraining on public data (our second case study).
end-to-end ASR models, including RNN transducer [20],
hybrid CTC/attention [21], and LAS (Listener, Attender,
Speller) [22] networks, as shown in the blue squares in Fig-
ure 1. Under a strict DP budget (ε=1), the PATE-trained end-
to-end models maintain data privacy at the expense of only a
slight increase in word error rate (WER). It also shows com-
petitive advantages when compared to differentially private
stochastic gradient descent (DP-SGD) [16, 23] benchmarks.
2. RELATED WORK
Recent research efforts to preserve data privacy in an ASR
system can be categorized into two groups: (i) systemic, such
as federated learning [24], data isolation [25], and data en-
cryption [26], and (ii) algorithmic, mainly differentially pri-
vate machine learning [16]. In this section, we first summa-
rize some of the privacy-preserving solutions proposed for
ASR. Next, we describe the substratum of differential privacy
devised for machine learning applications and discuss their
difference to our proposed approach, while highlighting its
key contributions.
2.1. Privacy-Preserving Automatic Speech Recognition
Federated machine learning algorithms [24, 27, 25] have been
investigated in the ASR community to improve information
protection. For instance, the average gradient method [27]
was deployed to update the model in ASR training. Verti-
cal federated learning methods [25] show some other bene-
fits from isolated features extractors under a heterogeneous
computation. However, these system-level frameworks of-
ten rely on assumptions about the constrained accessibility
of the malicious threats and barely provide universal and sta-
tistical justification for privacy guarantees. Cryptographic en-
cryption [26, 28] and computation protocols [29] are estab-
lished techniques for privacy-preserving speaker identifica-
tion. Meanwhile, these encryption algorithms and protocols
do not consider privacy protection for training samples, which
is a crucial element in developing machine learning models on
a large scale. Lately, differentially-private stochastic gradient
descent (DP-SGD) [16, 23] was introduced to allow quantita-
tive measurements and further protect identity inference (e.g.,
of accent or speaking condition) by introducing an additive
distortion during model training. In the remainder of the pa-
per, a mathematical formulation of differential privacy is pro-
vided, as well as a study of its effectiveness for ASR.
2.2. Differential Privacy Fundamentals
The differential privacy mechanism [13] is a standard method
to deploy algorithms with a target privacy guarantee.
Definition 1. A randomized algorithm Mwith domain
Dand range Ris (ε, δ)-differentially private if for any two
neighboring inputs (e.g., speech data) d, d0 D and for any
subset of output predictions (e.g., labels) S⊆ R, the follow-
ing holds:
Pr[M(d)S]eεPr [M(d0)S] + δ. (1)
The definition above produces a notion of privacy that can
be expressed as a measure of the probabilistic difference
of a specific outcome by a multiplicative factor, eε, and
an additive amount, δ. The DP mechanism with post-
processing [16] (e.g., batch-wise training) is under a general
Renyi-divergence [14] measurement with order α(1,),
摘要:

ANEXPERIMENTALSTUDYONPRIVATEAGGREGATIONOFTEACHERENSEMBLELEARNINGFOREND-TO-ENDSPEECHRECOGNITIONChao-HanHuckYang1;2I-FanChen2AndreasStolcke2SabatoMarcoSiniscalchi1;3Chin-HuiLee11GeorgiaInstituteofTechnology,USAand2AmazonAlexaAI,USA3DepartmentofElectronicSystems,NTNU,Trondheim,NorwayABSTRACTDifferenti...

展开>> 收起<<
AN EXPERIMENTAL STUDY ON PRIV ATE AGGREGATION OF TEACHER ENSEMBLE LEARNING FOR END-TO-END SPEECH RECOGNITION Chao-Han Huck Yang12I-Fan Chen2Andreas Stolcke2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:552.31KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注