AN EXPERIMENTAL STUDY ON PRIV ATE AGGREGATION OF TEACHER ENSEMBLE LEARNING FOR END-TO-END SPEECH RECOGNITION Chao-Han Huck Yang12I-Fan Chen2Andreas Stolcke2

2025-04-24 0 0 552.31KB 7 页 10玖币

侵权投诉

AN EXPERIMENTAL STUDY ON PRIVATE AGGREGATION OF TEACHER ENSEMBLE

LEARNING FOR END-TO-END SPEECH RECOGNITION

Chao-Han Huck Yang1,2∗I-Fan Chen2Andreas Stolcke2

Sabato Marco Siniscalchi1,3Chin-Hui Lee1

1Georgia Institute of Technology, USA and 2Amazon Alexa AI, USA

3Department of Electronic Systems, NTNU, Trondheim, Norway

ABSTRACT

Differential privacy (DP) is one data protection avenue to

safeguard user information used for training deep models by

imposing noisy distortion on privacy data. Such a noise per-

turbation often results in a severe performance degradation

in automatic speech recognition (ASR) in order to meet a

privacy budget ε. Private aggregation of teacher ensemble

(PATE) utilizes ensemble probabilities to improve ASR accu-

racy when dealing with the noise effects controlled by small

values of ε. We extend PATE learning to work with dynamic

patterns, namely speech utterances, and perform a ﬁrst exper-

imental demonstration that it prevents acoustic data leakage

in ASR training. We evaluate three end-to-end deep models,

including LAS, hybrid CTC/attention, and RNN transducer,

on the open-source LibriSpeech and TIMIT corpora. PATE

learning-enhanced ASR models outperform the benchmark

DP-SGD mechanisms, especially under strict DP budgets,

giving relative word error rate reductions between 26.2% and

27.5% for an RNN transducer model evaluated with Lib-

riSpeech. We also introduce a DP-preserving ASR solution

for pretraining on public speech corpora.

Index Terms—privacy-preserving learning, automatic

speech recognition, teacher-student learning, ensemble train-

ing.

1. INTRODUCTION

Automatic speech recognition (ASR) [1] is widely used in

spoken language processing applications, such as smart de-

vice control [2], intelligent human-machine dialogue [3], and

spoken language understanding [4]. To build ASR systems, a

large collection of user speech [5] is often required for train-

ing high-performance acoustic models. Protecting informa-

tion privacy and measuring numerical privacy loss, such as

whether data from a speciﬁc user is used for model training,

are becoming critical and prominent research topics for on-

device speech applications [6, 7, 8, 9, 10, 11, 12].

∗Work done at Georgia Tech and Amazon. Parts of this study were com-

pleted while the ﬁrst author was an intern at Amazon.

Differential privacy (DP) introduces a noise addition

scheme for information protection characterized by mea-

surable privacy budgets. The noise level is deﬁned by a

privacy budget (e.g., controlled by a minimum εvalue) in

ε-DP [13, 14] for a given data set. Machine learning frame-

works based on ε-DP have been shown effective against

leakage of training data (e.g., human faces), model inversion

attacks [15] (MIA) and membership inference, in which a

query-free algorithm is used to generate highly conﬁdent test

data similar to its training set. Deploying ε-DP in speech ap-

plications based on deep models has recently attracted much

interest [16]. However, directly applying ε-DP perturbation

on the training data could lead to severe performance (e.g.,

prediction accuracy) degradation [17]. Therefore, the noise-

enabled protection process needs to be designed carefully for

incorporation into machine learning for training speech mod-

els. Most published DP-based approaches are still limited to

recognition of isolated spoken commands [7]. Designing a

continuous speech recognition system with ε-DP protection

needs further investigation under a variety of privacy settings.

In this work we use ε-DP protection to show that acoustic

features used in ASR training data can been protected against

model inversion attack (MIA) [15]. We also demonstrate that

ε-DP can prevent data leakage from a pretrained ASR model.

Private aggregation of teacher ensembles (PATE) [18]

learning is a recently proposed framework that aims to avoid

accuracy loss in large-scale visual prediction models while

guaranteeing ε-DP protection. PATE is guided by a teacher-

student learning process, where multiple teachers make up

an ensemble for knowledge transfer. The idea underlying

PATE is to beneﬁt from aggregated noisy outputs of teacher

models to compartmentalize nonsensitive public data with

ε-DP protections. PATE and related approaches [19] were

proven effective in mitigating model accuracy deterioration

by employing an ensemble of teacher models with inde-

pendent noise perturbations. Nonetheless, for ASR systems

dealing with continuous speech, label stream prediction (e.g.,

by sequence level modeling) is required. In this study, we

apply the PATE framework to continuous speech recognition

and design different ensemble strategies to ensure ε-DP for

arXiv:2210.05614v2 [cs.SD] 13 Oct 2022

ASR Model

Publicly Accessible

(query-free)

Teacher

Models

Teacher

Models

(a) Teacher

ASR models

Unlabeled Public

Data (Non-Sensitive)

Sensitive Data

(e.g., Identity)

Label with noisy (differentially-private)

ensemble output

Knowledge Distillation &

Ensemble

Fig. 1: Proposed framework for utilizing private aggregation teacher ensemble (PATE) to train end-to-end ASR with ε-

differential privacy. Teacher ASR models could also be combined with pretraining on public data (our second case study).

end-to-end ASR models, including RNN transducer [20],

hybrid CTC/attention [21], and LAS (Listener, Attender,

Speller) [22] networks, as shown in the blue squares in Fig-

ure 1. Under a strict DP budget (ε=1), the PATE-trained end-

to-end models maintain data privacy at the expense of only a

slight increase in word error rate (WER). It also shows com-

petitive advantages when compared to differentially private

stochastic gradient descent (DP-SGD) [16, 23] benchmarks.

2. RELATED WORK

Recent research efforts to preserve data privacy in an ASR

system can be categorized into two groups: (i) systemic, such

as federated learning [24], data isolation [25], and data en-

cryption [26], and (ii) algorithmic, mainly differentially pri-

vate machine learning [16]. In this section, we ﬁrst summa-

rize some of the privacy-preserving solutions proposed for

ASR. Next, we describe the substratum of differential privacy

devised for machine learning applications and discuss their

difference to our proposed approach, while highlighting its

key contributions.

2.1. Privacy-Preserving Automatic Speech Recognition

Federated machine learning algorithms [24, 27, 25] have been

investigated in the ASR community to improve information

protection. For instance, the average gradient method [27]

was deployed to update the model in ASR training. Verti-

cal federated learning methods [25] show some other bene-

ﬁts from isolated features extractors under a heterogeneous

computation. However, these system-level frameworks of-

ten rely on assumptions about the constrained accessibility

of the malicious threats and barely provide universal and sta-

tistical justiﬁcation for privacy guarantees. Cryptographic en-

cryption [26, 28] and computation protocols [29] are estab-

lished techniques for privacy-preserving speaker identiﬁca-

tion. Meanwhile, these encryption algorithms and protocols

do not consider privacy protection for training samples, which

is a crucial element in developing machine learning models on

a large scale. Lately, differentially-private stochastic gradient

descent (DP-SGD) [16, 23] was introduced to allow quantita-

tive measurements and further protect identity inference (e.g.,

of accent or speaking condition) by introducing an additive

distortion during model training. In the remainder of the pa-

per, a mathematical formulation of differential privacy is pro-

vided, as well as a study of its effectiveness for ASR.

2.2. Differential Privacy Fundamentals

The differential privacy mechanism [13] is a standard method

to deploy algorithms with a target privacy guarantee.

Deﬁnition 1. A randomized algorithm Mwith domain

Dand range Ris (ε, δ)-differentially private if for any two

neighboring inputs (e.g., speech data) d, d0∈ D and for any

subset of output predictions (e.g., labels) S⊆ R, the follow-

ing holds:

Pr[M(d)∈S]≤eεPr [M(d0)∈S] + δ. (1)

The deﬁnition above produces a notion of privacy that can

be expressed as a measure of the probabilistic difference

of a speciﬁc outcome by a multiplicative factor, eε, and

an additive amount, δ. The DP mechanism with post-

processing [16] (e.g., batch-wise training) is under a general

Renyi-divergence [14] measurement with order α∈(1,∞),

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANEXPERIMENTALSTUDYONPRIVATEAGGREGATIONOFTEACHERENSEMBLELEARNINGFOREND-TO-ENDSPEECHRECOGNITIONChao-HanHuckYang1;2I-FanChen2AndreasStolcke2SabatoMarcoSiniscalchi1;3Chin-HuiLee11GeorgiaInstituteofTechnology,USAand2AmazonAlexaAI,USA3DepartmentofElectronicSystems,NTNU,Trondheim,NorwayABSTRACTDifferenti...

展开>> 收起<<

AN EXPERIMENTAL STUDY ON PRIV ATE AGGREGATION OF TEACHER ENSEMBLE LEARNING FOR END-TO-END SPEECH RECOGNITION Chao-Han Huck Yang12I-Fan Chen2Andreas Stolcke2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AN EXPERIMENTAL STUDY ON PRIV ATE AGGREGATION OF TEACHER ENSEMBLE LEARNING FOR END-TO-END SPEECH RECOGNITION Chao-Han Huck Yang12I-Fan Chen2Andreas Stolcke2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: