IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION USING CYCLEGAN AND INTER-DOMAIN LOSSES Chia-Yu Li and Ngoc Thang Vu

2025-05-08 0 0 686.64KB 8 页 10玖币
侵权投诉
IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION
USING CYCLEGAN AND INTER-DOMAIN LOSSES
Chia-Yu Li and Ngoc Thang Vu
Institute for Natural Language Processing (IMS), University of Stuttgart, Germany
ABSTRACT
We propose a novel method that combines CycleGAN and
inter-domain losses for semi-supervised end-to-end automatic
speech recognition. Inter-domain loss targets the extrac-
tion of an intermediate shared representation of speech and
text inputs using a shared network. CycleGAN uses cycle-
consistent loss and the identity mapping loss to preserve
relevant characteristics of the input feature after converting
from one domain to another. As such, both approaches are
suitable to train end-to-end models on unpaired speech-text
inputs. In this paper, we exploit the advantages from both
inter-domain loss and CycleGAN to achieve better shared
representation of unpaired speech and text inputs and thus
improve the speech-to-text mapping. Our experimental re-
sults on the WSJ eval92 and Voxforge (non English) show
88.5% character error rate reduction over the baseline,
and the results on LibriSpeech test clean also show notice-
able improvement.
Index Termsspeech recognition, End-to-end, semi-
supervised training, CycleGAN
1. INTRODUCTION
End-to-end (E2E) automatic speech recognition (ASR) di-
rectly learns the mapping from acoustic feature sequence
to a label, character or subword, sequence using a encoder-
decoder architecture [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
One of the popular architectures is hybrid CTC/attention,
which effectively utilizes the advantages of connectionist
temporal classification (CTC) based model and attention
based model in training and decoding [12, 13]. The CTC
model uses Markov assumptions to efficiently solve sequen-
tial problems by dynamic programming [1, 2], and the at-
tention model uses an attention mechanism [14] to perform
alignment between acoustic frames and labels. The hybrid
CTC/attention model improves the robustness and achieves
fast convergence, mitigates the alignment issues, and achieves
comparable performance as compared to the conventional
ASR based on a hidden Markov model (HMM)/deep neural
networks (DNNs) [13]. However, E2E model requires suf-
ficiently large amount of paired speech-text data to achieve
comparable performance [15, 16]. The paired data is expen-
sive, especially for low resource languages. There are huge
amount of free unpaired speech-text data on the Internet,
which we could make use of it with limited paired data to
improve the E2E ASR in a semi-supervised manner.
Cycle-consistent adversarial networks (CycleGAN) has
demonstrated better model generalization using the cycle-
consistent loss and the identity mapping loss on unpaired
data [17]. Most of the studies, in the field of semi-supervised
E2E ASR, exploit cycle-consistent loss to leverage unpaired
data by combining speech-to-text and text-to-speech or text-
to-text models [18, 19, 20, 21, 22, 23]. However, there is no
investigation into the effect of the identity mapping loss on
semi-supervised E2E ASR performance while Zhu et al. ob-
serve that the identity mapping loss helps preserve the color of
the input painting [17]. Besides, among the previously men-
tioned studies, an interesting work proposes the inter-domain
loss, that targets at the extraction of an intermediate shared
representation of speech and text using a shared network.
This work combines speech-to-text and text-to-text mappings
through the shared network in a semi-supervised end-to-end
manner and thus improves the speech-to-text performance
[21]. However, the inter-domain loss, which is the dissimi-
larity between the embedding from unpaired speech and text,
might introduce errors to the shared network because it tries
to minimize the distance between unpaired encoded speech
and text. For instance, if the speech is ”actually the word I
used there was presented” and the text is ”what’s wrong with
that”, the shared network learns to generate similar embed-
ding for both of them.
In this paper, we contribute to the previous work in the
following aspects: 1) To the best of our knowledge, we are the
first to investigate into the effect of the identity mapping loss
on semi-supervised E2E ASR performance; 2) We propose
a cycle-consistent inter-domain loss, which is dissimilarity
between encoded speech and hypothesis, in order to help the
shared network learn better representation; 3) We combine the
identity mapping loss and the cycle-consistent inter-domain
loss in a single framework for semi-supervised E2E ASR and
achieve noticeable performance improvement; 4) We provide
the analysis on the ASR output and the visualization of inter-
domain embedding from speech and text, which explains the
reason of performance gain by our proposed method.
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.11642v1 [cs.CL] 20 Oct 2022
2. METHOD
2.1. Semi-supervised E2E ASR
Fig. 1: The architecture of hybrid CTC/attention model [12,
13] (left) and the semi-supervised E2E model [21] (right).
Figure 1 (left) shows the architectures of hybrid CTC/atte-
ntion model within the multi-task learning framework [12,
13], which is a encoder-decoder model. The encoder is
trained by both CTC and attention objectives simultane-
ously. The encoder e(.)transforms the acoustic feature
x= [x1, x2, ...xm]to the embedding b= [b1, b2, ...bu],
then the attention-based decoder d(.)predicts the current la-
bel ytgiven the embedding band the previous label yt1.
The processing pipeline is defined as follows [13]:
b=e(x)
[Pr(yt|yt1, b), ht] =d(yt1, ht1, b)
Where y0=hSOSiis the start of a sequence label, and the
initial state h0is zero. We write the above Eqs. as a sequence
form [13]
d(b) = Pr(y|b) =
|y|
Y
t=1
Pr(yt|yt1, b)(1)
Where y= [y1, y2, ..., y|y|]is a predicted text, and |y|is the
length of text. The conventional loss for paired speech-text
data (x0, y0)Zis negative log likelihood of the ground-
truth text y0given the encoded speech e(x0)[13]:
Lpair =X
(x0,y0)Z
log Pr(y0|e(x0)) (2)
For semi-supervised E2E, Karita et al. propose a framework
which encodes speech and text into a common latent space
(B) and re-trains the E2E model on unpaired data by inter-
domain loss [24, 25, 26] and text-to-text autoencoder [27]
loss, see Figure 1 (right). We refer to the common intermedi-
ate embedding b B as an “inter-domain embedding”. The
input acoustic feature xis fed to the encoder e(.) = ˆe(f(.))
and is transformed to the inter-domain embedding b. On the
other hand, the input text is fed to the text embedding g(.)
and is processed by the shared encoder ˆe(.), which generates
the inter-domain embedding b0= ˆe(g(y)). The inter-domain
loss is the dissimilarity between both embedding band b0, see
the blue line in Figure 1 (right). Based on the author’s code1
and [21], the author has explored adversarial loss [28], Gaus-
sian KL-divergence [29] and Maximum Mean Discrepancy
(MMD) [30] for the inter-domain loss, and we choose the one
with the best result as the baseline for this work.
The text-to-text autoencoder loss measures a negative log-
likelihood that the encoder-decoder network can reconstruct
text from unpaired text [21, 27], see the orange loop in Figure
1 (right), the definition is as follows:
Ltext =Xlog Pr(y|ˆe(g(y))) (3)
Since the inter-domain loss for speech-to-text plays a dif-
ficult role due to large difference between the speech and text
domains, the objective consists of Lpair and Lunpair using
tunable parameter αas follows [21]:
L=αLpair + (1 α)Lunpair (4)
Note that Lpair is calculated on small paired data and Lunpair
is calculated on larger unpaired data. Lunpair is composed
of inter-domain loss and text-to-text autoencoder loss using
tunable speech-to-text ratio βas follows [21]:
Lunpair =βLdom + (1 β)Ltext (5)
2.2. Semi-supervised E2E ASR using CycleGAN losses
Fig. 2: Illustration of viewing some components in semi-
supervised E2E ASR as GeneratorA2B and GeneratorB2A.
Note that the g(.)is moved next to the Decoder is for the pur-
pose of better understanding.
CycleGAN exploits cycle consistency loss and identity
mapping loss to learn two mappings, G:ABand
F:BA, on unpaired data [17]. In the context of semi-
supervised E2E, the shared encoder ˆe(.)could be viewed as
the generator G:ABand the composition of decoder
and text embedding d(g(.)) could be seen as the generator
摘要:

IMPROVINGSEMI-SUPERVISEDEND-TO-ENDAUTOMATICSPEECHRECOGNITIONUSINGCYCLEGANANDINTER-DOMAINLOSSESChia-YuLiandNgocThangVuInstituteforNaturalLanguageProcessing(IMS),UniversityofStuttgart,GermanyABSTRACTWeproposeanovelmethodthatcombinesCycleGANandinter-domainlossesforsemi-supervisedend-to-endautomaticspee...

展开>> 收起<<
IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION USING CYCLEGAN AND INTER-DOMAIN LOSSES Chia-Yu Li and Ngoc Thang Vu.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:686.64KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注