IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION USING CYCLEGAN AND INTER-DOMAIN LOSSES Chia-Yu Li and Ngoc Thang Vu

2025-05-08 0 0 686.64KB 8 页 10玖币

侵权投诉

IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION

USING CYCLEGAN AND INTER-DOMAIN LOSSES

Chia-Yu Li and Ngoc Thang Vu

Institute for Natural Language Processing (IMS), University of Stuttgart, Germany

ABSTRACT

We propose a novel method that combines CycleGAN and

inter-domain losses for semi-supervised end-to-end automatic

speech recognition. Inter-domain loss targets the extrac-

tion of an intermediate shared representation of speech and

text inputs using a shared network. CycleGAN uses cycle-

consistent loss and the identity mapping loss to preserve

relevant characteristics of the input feature after converting

from one domain to another. As such, both approaches are

suitable to train end-to-end models on unpaired speech-text

inputs. In this paper, we exploit the advantages from both

inter-domain loss and CycleGAN to achieve better shared

representation of unpaired speech and text inputs and thus

improve the speech-to-text mapping. Our experimental re-

sults on the WSJ eval92 and Voxforge (non English) show

8∼8.5% character error rate reduction over the baseline,

and the results on LibriSpeech test clean also show notice-

able improvement.

Index Terms—speech recognition, End-to-end, semi-

supervised training, CycleGAN

1. INTRODUCTION

End-to-end (E2E) automatic speech recognition (ASR) di-

rectly learns the mapping from acoustic feature sequence

to a label, character or subword, sequence using a encoder-

decoder architecture [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].

One of the popular architectures is hybrid CTC/attention,

which effectively utilizes the advantages of connectionist

temporal classiﬁcation (CTC) based model and attention

based model in training and decoding [12, 13]. The CTC

model uses Markov assumptions to efﬁciently solve sequen-

tial problems by dynamic programming [1, 2], and the at-

tention model uses an attention mechanism [14] to perform

alignment between acoustic frames and labels. The hybrid

CTC/attention model improves the robustness and achieves

fast convergence, mitigates the alignment issues, and achieves

comparable performance as compared to the conventional

ASR based on a hidden Markov model (HMM)/deep neural

networks (DNNs) [13]. However, E2E model requires suf-

ﬁciently large amount of paired speech-text data to achieve

comparable performance [15, 16]. The paired data is expen-

sive, especially for low resource languages. There are huge

amount of free unpaired speech-text data on the Internet,

which we could make use of it with limited paired data to

improve the E2E ASR in a semi-supervised manner.

Cycle-consistent adversarial networks (CycleGAN) has

demonstrated better model generalization using the cycle-

consistent loss and the identity mapping loss on unpaired

data [17]. Most of the studies, in the ﬁeld of semi-supervised

E2E ASR, exploit cycle-consistent loss to leverage unpaired

data by combining speech-to-text and text-to-speech or text-

to-text models [18, 19, 20, 21, 22, 23]. However, there is no

investigation into the effect of the identity mapping loss on

semi-supervised E2E ASR performance while Zhu et al. ob-

serve that the identity mapping loss helps preserve the color of

the input painting [17]. Besides, among the previously men-

tioned studies, an interesting work proposes the inter-domain

loss, that targets at the extraction of an intermediate shared

representation of speech and text using a shared network.

This work combines speech-to-text and text-to-text mappings

through the shared network in a semi-supervised end-to-end

manner and thus improves the speech-to-text performance

[21]. However, the inter-domain loss, which is the dissimi-

larity between the embedding from unpaired speech and text,

might introduce errors to the shared network because it tries

to minimize the distance between unpaired encoded speech

and text. For instance, if the speech is ”actually the word I

used there was presented” and the text is ”what’s wrong with

that”, the shared network learns to generate similar embed-

ding for both of them.

In this paper, we contribute to the previous work in the

following aspects: 1) To the best of our knowledge, we are the

ﬁrst to investigate into the effect of the identity mapping loss

on semi-supervised E2E ASR performance; 2) We propose

a cycle-consistent inter-domain loss, which is dissimilarity

between encoded speech and hypothesis, in order to help the

shared network learn better representation; 3) We combine the

identity mapping loss and the cycle-consistent inter-domain

loss in a single framework for semi-supervised E2E ASR and

achieve noticeable performance improvement; 4) We provide

the analysis on the ASR output and the visualization of inter-

domain embedding from speech and text, which explains the

reason of performance gain by our proposed method.

arXiv:2210.11642v1 [cs.CL] 20 Oct 2022

2. METHOD

2.1. Semi-supervised E2E ASR

Fig. 1: The architecture of hybrid CTC/attention model [12,

13] (left) and the semi-supervised E2E model [21] (right).

Figure 1 (left) shows the architectures of hybrid CTC/atte-

ntion model within the multi-task learning framework [12,

13], which is a encoder-decoder model. The encoder is

trained by both CTC and attention objectives simultane-

ously. The encoder e(.)transforms the acoustic feature

x= [x1, x2, ...xm]to the embedding b= [b1, b2, ...bu],

then the attention-based decoder d(.)predicts the current la-

bel ytgiven the embedding band the previous label yt−1.

The processing pipeline is deﬁned as follows [13]:

b=e(x)

[Pr(yt|yt−1, b), ht] =d(yt−1, ht−1, b)

Where y0=hSOSiis the start of a sequence label, and the

initial state h0is zero. We write the above Eqs. as a sequence

form [13]

d(b) = Pr(y|b) =

|y|

t=1

Pr(yt|yt−1, b)(1)

Where y= [y1, y2, ..., y|y|]is a predicted text, and |y|is the

length of text. The conventional loss for paired speech-text

data (x0, y0)∈Zis negative log likelihood of the ground-

truth text y0given the encoded speech e(x0)[13]:

Lpair =−X

(x0,y0)∈Z

log Pr(y0|e(x0)) (2)

For semi-supervised E2E, Karita et al. propose a framework

which encodes speech and text into a common latent space

(B) and re-trains the E2E model on unpaired data by inter-

domain loss [24, 25, 26] and text-to-text autoencoder [27]

loss, see Figure 1 (right). We refer to the common intermedi-

ate embedding b∈ B as an “inter-domain embedding”. The

input acoustic feature xis fed to the encoder e(.) = ˆe(f(.))

and is transformed to the inter-domain embedding b. On the

other hand, the input text is fed to the text embedding g(.)

and is processed by the shared encoder ˆe(.), which generates

the inter-domain embedding b0= ˆe(g(y)). The inter-domain

loss is the dissimilarity between both embedding band b0, see

the blue line in Figure 1 (right). Based on the author’s code1

and [21], the author has explored adversarial loss [28], Gaus-

sian KL-divergence [29] and Maximum Mean Discrepancy

(MMD) [30] for the inter-domain loss, and we choose the one

with the best result as the baseline for this work.

The text-to-text autoencoder loss measures a negative log-

likelihood that the encoder-decoder network can reconstruct

text from unpaired text [21, 27], see the orange loop in Figure

1 (right), the deﬁnition is as follows:

Ltext =−Xlog Pr(y|ˆe(g(y))) (3)

Since the inter-domain loss for speech-to-text plays a dif-

ﬁcult role due to large difference between the speech and text

domains, the objective consists of Lpair and Lunpair using

tunable parameter αas follows [21]:

L=αLpair + (1 −α)Lunpair (4)

Note that Lpair is calculated on small paired data and Lunpair

is calculated on larger unpaired data. Lunpair is composed

of inter-domain loss and text-to-text autoencoder loss using

tunable speech-to-text ratio βas follows [21]:

Lunpair =βLdom + (1 −β)Ltext (5)

2.2. Semi-supervised E2E ASR using CycleGAN losses

Fig. 2: Illustration of viewing some components in semi-

supervised E2E ASR as GeneratorA2B and GeneratorB2A.

Note that the g(.)is moved next to the Decoder is for the pur-

pose of better understanding.

CycleGAN exploits cycle consistency loss and identity

mapping loss to learn two mappings, G:A→Band

F:B→A, on unpaired data [17]. In the context of semi-

supervised E2E, the shared encoder ˆe(.)could be viewed as

the generator G:A→Band the composition of decoder

and text embedding d(g(.)) could be seen as the generator

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMPROVINGSEMI-SUPERVISEDEND-TO-ENDAUTOMATICSPEECHRECOGNITIONUSINGCYCLEGANANDINTER-DOMAINLOSSESChia-YuLiandNgocThangVuInstituteforNaturalLanguageProcessing(IMS),UniversityofStuttgart,GermanyABSTRACTWeproposeanovelmethodthatcombinesCycleGANandinter-domainlossesforsemi-supervisedend-to-endautomaticspee...

展开>> 收起<<

IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION USING CYCLEGAN AND INTER-DOMAIN LOSSES Chia-Yu Li and Ngoc Thang Vu.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMPROVING SEMI-SUPERVISED END-TO-END AUTOMATIC SPEECH RECOGNITION USING CYCLEGAN AND INTER-DOMAIN LOSSES Chia-Yu Li and Ngoc Thang Vu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: