JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO SPEECH TRANSLATION Kun Wei1y Long Zhou2 Ziqiang Zhang2 Liping Chen2 Shujie Liu2 Lei He2 Jinyu Li2 Furu Wei2

2025-05-05 0 0 338.92KB 5 页 10玖币

侵权投诉

JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO

SPEECH TRANSLATION

Kun Wei1,†, Long Zhou2, Ziqiang Zhang2, Liping Chen2, Shujie Liu2, Lei He2, Jinyu Li2, Furu Wei2

1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,

Northwestern Polytechnical University, Xian, China

2Microsoft Corporation

ABSTRACT

Direct speech-to-speech translation (S2ST) is an attractive research

topic with many advantages compared to cascaded S2ST. However,

direct S2ST suffers from the data scarcity problem because the cor-

pora from speech of the source language to speech of the target lan-

guage are very rare. To address this issue, we propose in this pa-

per a Speech2S model, which is jointly pre-trained with unpaired

speech and bilingual text data for direct speech-to-speech transla-

tion tasks. By effectively leveraging the paired text data, Speech2S

is capable of modeling the cross-lingual speech conversion from

source to target language. We verify the performance of the pro-

posed Speech2S on Europarl-ST and VoxPopuli datasets. Experi-

mental results demonstrate that Speech2S gets an improvement of

about 5 BLEU scores compared to encoder-only pre-training mod-

els, and achieves a competitive or even better performance than ex-

isting state-of-the-art models1.

Index Terms—Speech to speech translation, joint pre-training,

cross-lingual modeling.

1. INTRODUCTION

Direct speech to speech translation (S2ST) has gained more and

more attention from research and industry communities in recent

years [1–3]. Traditionally, cascaded speech to speech translation

consists of automatic speech recognition (ASR), machine transla-

tion (MT), and text to speech synthesis (TTS) tasks. Direct S2ST

aims at integrating the above three tasks into an end-to-end model,

which translates the speech of one language to the speech of another

language directly. Compared to cascaded S2ST, direct S2ST has the

following advantages: (1) it is able to alleviate the error propagation

problem of pipeline systems; (2) it can retain the emotion, pitch, and

prosody information of the speaker to the greatest extent; (3) it has

faster reasoning speed and takes up fewer storage resources.

However, data scarcity is the biggest problem of direct speech

to speech translation tasks [4]. At present, there is very little parallel

S2ST data though lots of efforts [5–7]. To alleviate this problem, a

line of work tries to leverage pseudo data to improve direct S2ST

[3, 8]. They usually convert the ASR data into speech to text trans-

lation data using an MT system, and then generate the target audio

from the target text with a TTS system. Unfortunately, these meth-

ods do not guarantee the accuracy of the generated pseudo S2ST

data. Another line of work aims at boosting the performance of di-

rect S2ST through pre-training methods [3, 9]. For example, the pa-

†Work done during internship at Microsoft Research Asia.

1Code and pre-trained models are available at https://github.

com/microsoft/SpeechT5/tree/main/Speech2S.

per in [9] explores pre-training the encoder with mSLAM objective

[10], and pre-training the decoder of Translatoron 2 [11] with MT

task to generate phonemes. The authors in [3] propose to combine

wav2vec 2.0 [12] encoder and mBART [13] decoder to a speech-to-

unit translation (S2UT) model, which also can be further boosted by

data augmentation techniques.

Although the self-supervised pre-training method in [3] can ini-

tialize the direct S2ST model with the pre-trained wav2vec 2.0 en-

coder and mBART decoder, which are trained with discrete units

extracted with HuBERT [14] model from unlabeled speech data, it

still lacks effective connection between encoder and decoder, and ig-

nores the cross-lingual modeling capacity in pre-training. In the real

world, speech data, ASR data, and MT data are relative much more

than direct speech to speech corpora, and MT data can be utilized to

learn the transformation ability from source text to target text. How

to build the cross-lingual bridge between speech encoder and unit

decoder of direct S2ST with bilingual text in the pre-training stage

is not well explored.

In this paper, we propose a Speech2S model, which aims at mod-

eling cross-lingual information and alleviating data scarcity prob-

lems by jointly pre-training with unpaired speech and bilingual MT

text for the direct speech to speech translation task. More specially,

Speech2S consists of a speech encoder, unit encoder, and unit de-

coder. We propose two pre-training tasks to pre-train the three mod-

ules with unit encoder as the bridge between source speech and target

units. Like HuBERT [14], the ﬁrst pre-training objective is to pre-

dict the clustered units based on the output of both speech encoder

and unit encoder, with unlabeled speech data. To take advantage

of bilingual machine translation corpus, we ﬁrst leverage two text-

to-unit models to convert source/target text into source/target units,

with which, the cross-lingual unit encoder and decoder can be well

pre-trained through cross-entropy loss.

We evaluate the proposed model on Europarl-ST [15] and Vox-

Populi [5] S2ST datasets. Our contributions can be summarized as

follows. (1) We propose a joint pre-trained Speech2S model, which

can take advantage of bilingual text data to boost bilingual speech

conversion. (2) The proposed model achieves a signiﬁcant improve-

ment of about 5 BLEU scores compared to the pre-trained model

without MT data. (3) Furthermore, we conduct a detailed analysis

about the effect of parallel data size, data augmentation of different

domains, and subjective evaluation.

2. RELATED WORK

Conventional speech to speech translation is usually composed of

cascaded ASR, MT and TTS modules [16, 17]. On this basis, to

avoid error transmission caused by cascade models, researchers ex-

arXiv:2210.17027v1 [cs.SD] 31 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JOINTPRE-TRAININGWITHSPEECHANDBILINGUALTEXTFORDIRECTSPEECHTOSPEECHTRANSLATIONKunWei1;y,LongZhou2,ZiqiangZhang2,LipingChen2,ShujieLiu2,LeiHe2,JinyuLi2,FuruWei21Audio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity,Xian,China2MicrosoftCorporationA...

展开>> 收起<<

JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO SPEECH TRANSLATION Kun Wei1y Long Zhou2 Ziqiang Zhang2 Liping Chen2 Shujie Liu2 Lei He2 Jinyu Li2 Furu Wei2.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO SPEECH TRANSLATION Kun Wei1y Long Zhou2 Ziqiang Zhang2 Liping Chen2 Shujie Liu2 Lei He2 Jinyu Li2 Furu Wei2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: