IMPROVING SPEECH-TO-SPEECH TRANSLATION THROUGH UNLABELED TEXT Xuan-Phi Nguyeny Sravya Popuri Changhan Wang Yun Tang Ilia Kulikovand Hongyu Gong Meta AI USA

2025-05-08 0 0 311.02KB 5 页 10玖币
侵权投诉
IMPROVING SPEECH-TO-SPEECH TRANSLATION THROUGH UNLABELED TEXT
Xuan-Phi Nguyen∗† , Sravya Popuri?, Changhan Wang?, Yun Tang?, Ilia Kulikov?and Hongyu Gong?
?Meta AI, USA
Nanyang Technological University, Singapore
nguyenxu002@e.ntu.edu.sg
{spopuri,changhan,yuntang,kulikov,hygong}@meta.com
ABSTRACT
Direct speech-to-speech translation (S2ST) is among the
most challenging problems in the translation paradigm due
to the significant scarcity of S2ST data. While effort has
been made to increase the data size from unlabeled speech
by cascading pretrained speech recognition (ASR), machine
translation (MT) and text-to-speech (TTS) models; unlabeled
text has remained relatively under-utilized to improve S2ST.
We propose an effective way to utilize the massive existing
unlabeled text from different languages to create a large
amount of S2ST data to improve S2ST performance by
applying various acoustic effects to the generated synthetic
data. Empirically our method outperforms the state of the art
in Spanish-English translation by up to
2
BLEU. Significant
gains by the proposed method are demonstrated in extremely
low-resource settings for both Spanish-English and Russian-
English translations.
Index Terms
Speech-to-speech translation, augmentation,
unlabeled text
1. INTRODUCTION
Translating speech of a language to speech of another can be
done by trivially cascading an automatic speech recognition
(ASR) [
1
], machine translation (MT) [
2
,
3
,
4
], or combinatory
speech-to-text translation (S2T) [
5
] and finally a text-to-speech
(TTS) systems [
6
,
7
,
8
,
9
]. But such process suffers from
significant inference latency and is prone to error propagation
through each stage. Alternatively, there is a growing interest in
developing direct speech-to-speech translation systems (S2ST)
[
10
,
11
,
12
]. Not only do these systems have faster inference
but also allow translations between unwritten languages and
dialects [
13
]. Recent speech-to-speech model [
12
] employs a
self-supervised pretrained speech encoder [
14
] and a discrete
unit mBART model to train a speech-to-unit translation
(S2UT), where the target speech is converted into discrete
units [
15
]. Such units are trained with self-supervision to
group speech frames based on their linguistic and prosodic
information using k-means clustering [16].
Work done during an internship at Meta AI.
Despite promising capability, direct speech-to-speech
(S2ST) models suffer from significant data scarcity due to
the challenge of collecting human annotated parallel speech.
To improve S2ST performance, apart from self-supervised
pretraining on unlabeled speech [
10
,
14
,
17
], [
12
] also tried to
generate extra synthetic S2ST data from speech recognition
(ASR) data by converting its texts to speech in the target
language using a cascaded MT-TTS system [
18
]. Nonetheless,
such work only makes use of the available audio data, while
not utilizing the existing unlabeled text data from numerous
languages, sources and domains [
19
,
17
,
20
]. Text data is
known to be much more massive and diverse than the current
speech data. However, such textual data may be difficult
to deal with in the speech paradigm as they lack crucial
information about speakers, speed, pitch and emotions.
In this work, we present an effective strategy to generate
synthetic speech-to-speech training data from the abundant
unlabeled text data so that the resulting speech data is not only
diverse in terms of semantic content, but also randomly varying
in acoustic features such as speaker tones. Our approach
consists of two processes: (i) “Text-aug” data generation,
which is the synthetic S2ST data created from unlabeled text;
and (ii) “Effects-aug” process, which is an on-the-fly speech
augmentation process that transforms toneless Text-aug speech
into a varying-tone and noisy version that tries to mimic
the distribution of real speech data. Our approach does not
introduce any extra model or data supervision besides those
used in the recent S2UT baseline [
12
], which use supervised
MT and TTS models. Instead, we utilize an unsupervised MT
model [
3
] to generate data. In the experiments, our method
achieves up to 35.2 BLEU on the CoVoST-2 Es-En task and
35.1 BLEU on Europarl-ST En-Es task, surpassing the state-
of-the-art approach [
12
] by up to 2 BLEU. Further analysis
shows that Effects-aug is a crucial step for the extra data
to improve the performance. We also demonstrate that our
method achieves significant performance gain of up to 28
BLEU in low-resource speech-to-speech setups with only 10
hours (hr) S2ST data.
arXiv:2210.14514v1 [cs.CL] 26 Oct 2022
摘要:

IMPROVINGSPEECH-TO-SPEECHTRANSLATIONTHROUGHUNLABELEDTEXTXuan-PhiNguyeny,SravyaPopuri?,ChanghanWang?,YunTang?,IliaKulikov?andHongyuGong??MetaAI,USAyNanyangTechnologicalUniversity,Singaporenguyenxu002@e.ntu.edu.sg{spopuri,changhan,yuntang,kulikov,hygong}@meta.comABSTRACTDirectspeech-to-speechtranslat...

展开>> 收起<<
IMPROVING SPEECH-TO-SPEECH TRANSLATION THROUGH UNLABELED TEXT Xuan-Phi Nguyeny Sravya Popuri Changhan Wang Yun Tang Ilia Kulikovand Hongyu Gong Meta AI USA.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:311.02KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注