
AN EXPERIMENTAL STUDY ON PRIVATE AGGREGATION OF TEACHER ENSEMBLE
LEARNING FOR END-TO-END SPEECH RECOGNITION
Chao-Han Huck Yang1,2∗I-Fan Chen2Andreas Stolcke2
Sabato Marco Siniscalchi1,3Chin-Hui Lee1
1Georgia Institute of Technology, USA and 2Amazon Alexa AI, USA
3Department of Electronic Systems, NTNU, Trondheim, Norway
ABSTRACT
Differential privacy (DP) is one data protection avenue to
safeguard user information used for training deep models by
imposing noisy distortion on privacy data. Such a noise per-
turbation often results in a severe performance degradation
in automatic speech recognition (ASR) in order to meet a
privacy budget ε. Private aggregation of teacher ensemble
(PATE) utilizes ensemble probabilities to improve ASR accu-
racy when dealing with the noise effects controlled by small
values of ε. We extend PATE learning to work with dynamic
patterns, namely speech utterances, and perform a first exper-
imental demonstration that it prevents acoustic data leakage
in ASR training. We evaluate three end-to-end deep models,
including LAS, hybrid CTC/attention, and RNN transducer,
on the open-source LibriSpeech and TIMIT corpora. PATE
learning-enhanced ASR models outperform the benchmark
DP-SGD mechanisms, especially under strict DP budgets,
giving relative word error rate reductions between 26.2% and
27.5% for an RNN transducer model evaluated with Lib-
riSpeech. We also introduce a DP-preserving ASR solution
for pretraining on public speech corpora.
Index Terms—privacy-preserving learning, automatic
speech recognition, teacher-student learning, ensemble train-
ing.
1. INTRODUCTION
Automatic speech recognition (ASR) [1] is widely used in
spoken language processing applications, such as smart de-
vice control [2], intelligent human-machine dialogue [3], and
spoken language understanding [4]. To build ASR systems, a
large collection of user speech [5] is often required for train-
ing high-performance acoustic models. Protecting informa-
tion privacy and measuring numerical privacy loss, such as
whether data from a specific user is used for model training,
are becoming critical and prominent research topics for on-
device speech applications [6, 7, 8, 9, 10, 11, 12].
∗Work done at Georgia Tech and Amazon. Parts of this study were com-
pleted while the first author was an intern at Amazon.
Differential privacy (DP) introduces a noise addition
scheme for information protection characterized by mea-
surable privacy budgets. The noise level is defined by a
privacy budget (e.g., controlled by a minimum εvalue) in
ε-DP [13, 14] for a given data set. Machine learning frame-
works based on ε-DP have been shown effective against
leakage of training data (e.g., human faces), model inversion
attacks [15] (MIA) and membership inference, in which a
query-free algorithm is used to generate highly confident test
data similar to its training set. Deploying ε-DP in speech ap-
plications based on deep models has recently attracted much
interest [16]. However, directly applying ε-DP perturbation
on the training data could lead to severe performance (e.g.,
prediction accuracy) degradation [17]. Therefore, the noise-
enabled protection process needs to be designed carefully for
incorporation into machine learning for training speech mod-
els. Most published DP-based approaches are still limited to
recognition of isolated spoken commands [7]. Designing a
continuous speech recognition system with ε-DP protection
needs further investigation under a variety of privacy settings.
In this work we use ε-DP protection to show that acoustic
features used in ASR training data can been protected against
model inversion attack (MIA) [15]. We also demonstrate that
ε-DP can prevent data leakage from a pretrained ASR model.
Private aggregation of teacher ensembles (PATE) [18]
learning is a recently proposed framework that aims to avoid
accuracy loss in large-scale visual prediction models while
guaranteeing ε-DP protection. PATE is guided by a teacher-
student learning process, where multiple teachers make up
an ensemble for knowledge transfer. The idea underlying
PATE is to benefit from aggregated noisy outputs of teacher
models to compartmentalize nonsensitive public data with
ε-DP protections. PATE and related approaches [19] were
proven effective in mitigating model accuracy deterioration
by employing an ensemble of teacher models with inde-
pendent noise perturbations. Nonetheless, for ASR systems
dealing with continuous speech, label stream prediction (e.g.,
by sequence level modeling) is required. In this study, we
apply the PATE framework to continuous speech recognition
and design different ensemble strategies to ensure ε-DP for
arXiv:2210.05614v2 [cs.SD] 13 Oct 2022