JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO SPEECH TRANSLATION Kun Wei1y Long Zhou2 Ziqiang Zhang2 Liping Chen2 Shujie Liu2 Lei He2 Jinyu Li2 Furu Wei2

2025-05-05 0 0 338.92KB 5 页 10玖币
侵权投诉
JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO
SPEECH TRANSLATION
Kun Wei1,, Long Zhou2, Ziqiang Zhang2, Liping Chen2, Shujie Liu2, Lei He2, Jinyu Li2, Furu Wei2
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xian, China
2Microsoft Corporation
ABSTRACT
Direct speech-to-speech translation (S2ST) is an attractive research
topic with many advantages compared to cascaded S2ST. However,
direct S2ST suffers from the data scarcity problem because the cor-
pora from speech of the source language to speech of the target lan-
guage are very rare. To address this issue, we propose in this pa-
per a Speech2S model, which is jointly pre-trained with unpaired
speech and bilingual text data for direct speech-to-speech transla-
tion tasks. By effectively leveraging the paired text data, Speech2S
is capable of modeling the cross-lingual speech conversion from
source to target language. We verify the performance of the pro-
posed Speech2S on Europarl-ST and VoxPopuli datasets. Experi-
mental results demonstrate that Speech2S gets an improvement of
about 5 BLEU scores compared to encoder-only pre-training mod-
els, and achieves a competitive or even better performance than ex-
isting state-of-the-art models1.
Index TermsSpeech to speech translation, joint pre-training,
cross-lingual modeling.
1. INTRODUCTION
Direct speech to speech translation (S2ST) has gained more and
more attention from research and industry communities in recent
years [1–3]. Traditionally, cascaded speech to speech translation
consists of automatic speech recognition (ASR), machine transla-
tion (MT), and text to speech synthesis (TTS) tasks. Direct S2ST
aims at integrating the above three tasks into an end-to-end model,
which translates the speech of one language to the speech of another
language directly. Compared to cascaded S2ST, direct S2ST has the
following advantages: (1) it is able to alleviate the error propagation
problem of pipeline systems; (2) it can retain the emotion, pitch, and
prosody information of the speaker to the greatest extent; (3) it has
faster reasoning speed and takes up fewer storage resources.
However, data scarcity is the biggest problem of direct speech
to speech translation tasks [4]. At present, there is very little parallel
S2ST data though lots of efforts [5–7]. To alleviate this problem, a
line of work tries to leverage pseudo data to improve direct S2ST
[3, 8]. They usually convert the ASR data into speech to text trans-
lation data using an MT system, and then generate the target audio
from the target text with a TTS system. Unfortunately, these meth-
ods do not guarantee the accuracy of the generated pseudo S2ST
data. Another line of work aims at boosting the performance of di-
rect S2ST through pre-training methods [3, 9]. For example, the pa-
Work done during internship at Microsoft Research Asia.
1Code and pre-trained models are available at https://github.
com/microsoft/SpeechT5/tree/main/Speech2S.
per in [9] explores pre-training the encoder with mSLAM objective
[10], and pre-training the decoder of Translatoron 2 [11] with MT
task to generate phonemes. The authors in [3] propose to combine
wav2vec 2.0 [12] encoder and mBART [13] decoder to a speech-to-
unit translation (S2UT) model, which also can be further boosted by
data augmentation techniques.
Although the self-supervised pre-training method in [3] can ini-
tialize the direct S2ST model with the pre-trained wav2vec 2.0 en-
coder and mBART decoder, which are trained with discrete units
extracted with HuBERT [14] model from unlabeled speech data, it
still lacks effective connection between encoder and decoder, and ig-
nores the cross-lingual modeling capacity in pre-training. In the real
world, speech data, ASR data, and MT data are relative much more
than direct speech to speech corpora, and MT data can be utilized to
learn the transformation ability from source text to target text. How
to build the cross-lingual bridge between speech encoder and unit
decoder of direct S2ST with bilingual text in the pre-training stage
is not well explored.
In this paper, we propose a Speech2S model, which aims at mod-
eling cross-lingual information and alleviating data scarcity prob-
lems by jointly pre-training with unpaired speech and bilingual MT
text for the direct speech to speech translation task. More specially,
Speech2S consists of a speech encoder, unit encoder, and unit de-
coder. We propose two pre-training tasks to pre-train the three mod-
ules with unit encoder as the bridge between source speech and target
units. Like HuBERT [14], the first pre-training objective is to pre-
dict the clustered units based on the output of both speech encoder
and unit encoder, with unlabeled speech data. To take advantage
of bilingual machine translation corpus, we first leverage two text-
to-unit models to convert source/target text into source/target units,
with which, the cross-lingual unit encoder and decoder can be well
pre-trained through cross-entropy loss.
We evaluate the proposed model on Europarl-ST [15] and Vox-
Populi [5] S2ST datasets. Our contributions can be summarized as
follows. (1) We propose a joint pre-trained Speech2S model, which
can take advantage of bilingual text data to boost bilingual speech
conversion. (2) The proposed model achieves a significant improve-
ment of about 5 BLEU scores compared to the pre-trained model
without MT data. (3) Furthermore, we conduct a detailed analysis
about the effect of parallel data size, data augmentation of different
domains, and subjective evaluation.
2. RELATED WORK
Conventional speech to speech translation is usually composed of
cascaded ASR, MT and TTS modules [16, 17]. On this basis, to
avoid error transmission caused by cascade models, researchers ex-
arXiv:2210.17027v1 [cs.SD] 31 Oct 2022
摘要:

JOINTPRE-TRAININGWITHSPEECHANDBILINGUALTEXTFORDIRECTSPEECHTOSPEECHTRANSLATIONKunWei1;y,LongZhou2,ZiqiangZhang2,LipingChen2,ShujieLiu2,LeiHe2,JinyuLi2,FuruWei21Audio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity,Xian,China2MicrosoftCorporationA...

展开>> 收起<<
JOINT PRE-TRAINING WITH SPEECH AND BILINGUAL TEXT FOR DIRECT SPEECH TO SPEECH TRANSLATION Kun Wei1y Long Zhou2 Ziqiang Zhang2 Liping Chen2 Shujie Liu2 Lei He2 Jinyu Li2 Furu Wei2.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:338.92KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注