
IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION
USING CYCLEGAN AND INTER-DOMAIN LOSSES
Chia-Yu Li and Ngoc Thang Vu
Institute for Natural Language Processing (IMS), University of Stuttgart, Germany
ABSTRACT
We propose a novel method that combines CycleGAN and
inter-domain losses for semi-supervised end-to-end automatic
speech recognition. Inter-domain loss targets the extrac-
tion of an intermediate shared representation of speech and
text inputs using a shared network. CycleGAN uses cycle-
consistent loss and the identity mapping loss to preserve
relevant characteristics of the input feature after converting
from one domain to another. As such, both approaches are
suitable to train end-to-end models on unpaired speech-text
inputs. In this paper, we exploit the advantages from both
inter-domain loss and CycleGAN to achieve better shared
representation of unpaired speech and text inputs and thus
improve the speech-to-text mapping. Our experimental re-
sults on the WSJ eval92 and Voxforge (non English) show
8∼8.5% character error rate reduction over the baseline,
and the results on LibriSpeech test clean also show notice-
able improvement.
Index Terms—speech recognition, End-to-end, semi-
supervised training, CycleGAN
1. INTRODUCTION
End-to-end (E2E) automatic speech recognition (ASR) di-
rectly learns the mapping from acoustic feature sequence
to a label, character or subword, sequence using a encoder-
decoder architecture [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
One of the popular architectures is hybrid CTC/attention,
which effectively utilizes the advantages of connectionist
temporal classification (CTC) based model and attention
based model in training and decoding [12, 13]. The CTC
model uses Markov assumptions to efficiently solve sequen-
tial problems by dynamic programming [1, 2], and the at-
tention model uses an attention mechanism [14] to perform
alignment between acoustic frames and labels. The hybrid
CTC/attention model improves the robustness and achieves
fast convergence, mitigates the alignment issues, and achieves
comparable performance as compared to the conventional
ASR based on a hidden Markov model (HMM)/deep neural
networks (DNNs) [13]. However, E2E model requires suf-
ficiently large amount of paired speech-text data to achieve
comparable performance [15, 16]. The paired data is expen-
sive, especially for low resource languages. There are huge
amount of free unpaired speech-text data on the Internet,
which we could make use of it with limited paired data to
improve the E2E ASR in a semi-supervised manner.
Cycle-consistent adversarial networks (CycleGAN) has
demonstrated better model generalization using the cycle-
consistent loss and the identity mapping loss on unpaired
data [17]. Most of the studies, in the field of semi-supervised
E2E ASR, exploit cycle-consistent loss to leverage unpaired
data by combining speech-to-text and text-to-speech or text-
to-text models [18, 19, 20, 21, 22, 23]. However, there is no
investigation into the effect of the identity mapping loss on
semi-supervised E2E ASR performance while Zhu et al. ob-
serve that the identity mapping loss helps preserve the color of
the input painting [17]. Besides, among the previously men-
tioned studies, an interesting work proposes the inter-domain
loss, that targets at the extraction of an intermediate shared
representation of speech and text using a shared network.
This work combines speech-to-text and text-to-text mappings
through the shared network in a semi-supervised end-to-end
manner and thus improves the speech-to-text performance
[21]. However, the inter-domain loss, which is the dissimi-
larity between the embedding from unpaired speech and text,
might introduce errors to the shared network because it tries
to minimize the distance between unpaired encoded speech
and text. For instance, if the speech is ”actually the word I
used there was presented” and the text is ”what’s wrong with
that”, the shared network learns to generate similar embed-
ding for both of them.
In this paper, we contribute to the previous work in the
following aspects: 1) To the best of our knowledge, we are the
first to investigate into the effect of the identity mapping loss
on semi-supervised E2E ASR performance; 2) We propose
a cycle-consistent inter-domain loss, which is dissimilarity
between encoded speech and hypothesis, in order to help the
shared network learn better representation; 3) We combine the
identity mapping loss and the cycle-consistent inter-domain
loss in a single framework for semi-supervised E2E ASR and
achieve noticeable performance improvement; 4) We provide
the analysis on the ASR output and the visualization of inter-
domain embedding from speech and text, which explains the
reason of performance gain by our proposed method.
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.11642v1 [cs.CL] 20 Oct 2022