SpeechUT Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training Ziqiang Zhang1 Long Zhou2y Junyi Ao3 Shujie Liu2 Lirong Dai1 Jinyu Li2 Furu Wei2

2025-05-03 0 0 2.17MB 14 页 10玖币
侵权投诉
SpeechUT: Bridging Speech and Text with Hidden-Unit for
Encoder-Decoder Based Speech-Text Pre-training
Ziqiang Zhang1,
, Long Zhou2,, Junyi Ao3,, Shujie Liu2, Lirong Dai1, Jinyu Li2, Furu Wei2
1University of Science and Technology of China
2Microsoft
3The Chinese University of Hong Kong, Shenzhen
Abstract
The rapid development of single-modal pre-
training has prompted researchers to pay more
attention to cross-modal pre-training meth-
ods. In this paper, we propose a unified-
modal speech-unit-text pre-training model,
SpeechUT, to connect the representations of
a speech encoder and a text decoder with a
shared unit encoder. Leveraging hidden-unit
as an interface to align speech and text, we
can decompose the speech-to-text model into a
speech-to-unit model and a unit-to-text model,
which can be jointly pre-trained with unpaired
speech and text data respectively. Our pro-
posed SpeechUT is fine-tuned and evaluated
on automatic speech recognition (ASR) and
speech translation (ST) tasks. Experimen-
tal results show that SpeechUT gets substan-
tial improvements over strong baselines, and
achieves state-of-the-art performance on both
the LibriSpeech ASR and MuST-C ST tasks.
To better understand the proposed SpeechUT,
detailed analyses are conducted. The code
and pre-trained models are available at https:
//aka.ms/SpeechUT.
1 Introduction
Self-supervised pre-training with large-scale unla-
beled data obtains remarkable progress on various
downstream tasks (Devlin et al.,2019;Radford
et al.,2019;Dong et al.,2019;Baevski et al.,2020;
Hsu et al.,2021;Chen et al.,2021). Specifically,
pre-trained models, such as BERT (Devlin et al.,
2019) and GPT (Radford et al.,2019), have exten-
sively promoted the development of natural lan-
guage processing (NLP). Researchers also develop
many pre-trained speech models utilizing a mass
of unlabeled audio data, e.g., wav2vec (Baevski
et al.,2020) and HuBERT (Hsu et al.,2021). Al-
though text and speech are two different modali-
ties, they have a natural relationship because they
Work done during internship at Microsoft Research Asia.
Corresponding author.
Speech Unit
Unit Te xt
Pre-train
Fine-tune
Shared
Figure 1: A high-level illustration of SpeechUT. After
pre-trained with speech-to-unit and unit-to-text tasks
(blue arrows), the model with a shared unit encoder en-
ables speech-to-text tasks for fine-tuning (red arrow).
can be viewed as two kinds of expressions of lan-
guage. Hence, joint pre-training of speech and text
has received increasing attention from the research
community in recent years (Ao et al.,2022a;Bapna
et al.,2021;Tang et al.,2022).
One line of speech-text joint pre-training builds
a shared encoder to learn speech and text represen-
tation jointly, such as SLAM (Bapna et al.,2021),
which needs a random initialization of the de-
coder parameter for fine-tuning an encoder-decoder
model. Another line of studies, e.g., SpeechT5 (Ao
et al.,2022a) and STPT (Tang et al.,2022), directly
pre-trains an encoder-decoder model on speech and
text corpus to boost the performance of automatic
speech recognition (ASR) and speech translation
(ST), leveraging unsupervised vector quantization
(van den Oord et al.,2017) and supervised speech-
text data to encourage the alignment of speech and
text respectively. For these cross-modal speech-
to-text models, a key problem is how to naturally
connect the speech encoder and the text decoder.
Our preliminary observation shows that an in-
termediate hidden-unit representation (Hsu et al.,
2021) can be regarded as the bridge between speech
and text modalities, and it can provide a strong map-
ping relationship with both of them (see Appendix
A). This inspires us to leverage hidden-unit as
the semantic interface between the speech encoder
and the text decoder in the encoder-decoder frame-
work, and decompose the speech-to-text model into
arXiv:2210.03730v1 [cs.CL] 7 Oct 2022
a speech-to-unit model and a unit-to-text model,
which can be pre-trained with unpaired speech and
text data respectively, as shown in Figure 1.
In this paper, we propose a unified speech-
unit-text pre-training method (
SpeechUT
), using
hidden-unit representation as a bridge between the
speech-encoder and the text-decoder. SpeechUT
leverages three unsupervised pre-training tasks, in-
cluding a speech-to-unit (S2U) task to model the
mapping between speech and unit like HuBERT,
masked unit modeling (MUM) task to learn better
unit representation, and a unit-to-text (U2T) task to
recover text from middle shared hidden-unit repre-
sentation. To generate training data for S2U, MUM,
and U2T, two off-line generators trained with a
small amount of paired data (100h) are introduced
to produce discrete unit sequences for large-scale
unpaired speech and text. Experiments are con-
ducted on two typical speech-to-text tasks, ASR
and ST, followed by principal analysis to better un-
derstand the proposed method. The contributions
of this paper are summarized as follows,
We propose a unified speech-text pre-training
method SpeechUT to bridge the speech en-
coder and the text decoder with hidden units.
We decouple the speech-to-text model into
speech-to-unit and unit-to-text models, to
take advantage of a large amount of unpaired
speech and text data for pre-training.
Our proposed SpeechUT achieves state-of-the-
art performance in downstream speech recog-
nition and speech translation tasks.
2 Related Work
The proposed SpeechUT is built upon the Trans-
former encoder-decoder model (Vaswani et al.,
2017) and relates to discrete speech representa-
tion learning and joint speech-text pre-training. We
discuss these topics in the following.
Discrete Speech Representation Learning
Dis-
cretizing continuous speech signals for speech rep-
resentation learning has drawn substantial attention.
Vq-wav2vec (Baevski et al.,2019) and wav2vec 2.0
(Baevski et al.,2020) attempt at discretizing speech
signals into quantized units from a learnable code-
book (van den Oord et al.,2017). PBERT (Wang
et al.,2022a) instead uses phonemes as the discrete
targets in a semi-supervised setting. SemFace (Ren
et al.,2021) proposes to use language-independent
vector quantized units as the semantic interface
of encoder pre-training and decoder pre-training.
Inspired by the masked language model in BERT
(Devlin et al.,2019), HuBERT (Hsu et al.,2021)
first introduces the masked speech prediction of
hidden units to pre-train a universal speech model.
Particularly, the hidden units can be clustered from
log Mel-filterbank features or the hidden states of
the previous pre-trained model. Recently, some
studies explore leveraging the discrete hidden units
to build speech-to-speech translation systems (Lee
et al.,2021a,b), which first convert source speech
into target units, then generate the target waveform
from predicted units. However, our goal in this
paper is to jointly pre-train speech and text with
the hidden units as the intermediate bridge.
Joint Speech-Text Pre-Training
Single-modal
pre-trained models have achieved remarkable re-
sults in both natural language processing and spo-
ken language processing, such as BERT (Vaswani
et al.,2017), UniLM (Dong et al.,2019), XLNet
(Yang et al.,2019), wav2vec 2.0 (Baevski et al.,
2020), HuBERT (Hsu et al.,2021), and WavLM
(Chen et al.,2021). Thanks to the rapid devel-
opment of these single-modal pre-training works,
researchers begin to pre-train a cross-modal model
with both speech and text data (Chung et al.,2021b;
Kim et al.,2021;Qian et al.,2021;Ao et al.,2022a;
Bapna et al.,2021;Zhang et al.,2022b;Tang et al.,
2022). One category of these works focuses on
pre-training a unified encoder model for spoken
language understanding (Chung et al.,2021b;Kim
et al.,2021;Qian et al.,2021;Zhang et al.,2022a).
In parallel to our work, SpeechLM (Zhang et al.,
2022a) leverages two kinds of tokenizers to tok-
enize speech and text, and aims at unifying speech
and text modalities into the same semantic space
within one encoder model. When fine-tuning an
encoder-decoder model, a randomly initialized de-
coder needs to be superimposed on the encoder
for speech-to-text tasks (Bapna et al.,2021,2022).
Besides, Maestro (Chen et al.,2022) utilizes paired
speech-text data to learn speech-text alignment
through a modality-matching algorithm in RNN-
T framework. Our proposed SpeechUT model is
most related to encoder-decoder pre-trained mod-
els like SpeechT5 (Ao et al.,2022a) and STPT
(Tang et al.,2022), in which speech and text are di-
rectly connected by a shared encoder. Unlike them,
SpeechUT leverages hidden units (Hsu et al.,2021)
as the bridge between the speech encoder and the
Speech Encoder
Speech Pre-net
Masking
Unit Encoder
17 296 20 317 20 34..
T2U Generator
W H A T | I S | T H A T
𝑆−
𝑆−
off-line
off-line
S2U Generator
 𝑆−
𝑆−
Unit Pre-net
Speech Encoder
Speech Pre-net
Masking
Unit Encoder
Text Decoder
I T ’ S | S P E E C H U T
Text Decoder
(a) Pre-training pipeline of SpeechUT (b) Fine-tuning for ASR/ST
13 46 316 20 18 296..
Masking
Emb Mixing
Figure 2: (a) The overall framework of SpeechUT, which is pre-trained with the speech-to-unit (S2U) task, the
masked unit modeling (MUM) task and the unit-to-text (U2T) task jointly. The discrete units are extracted from
off-line speech-to-unit (S2U) and text-to-unit (T2U) generators. (b) Fine-tuning is performed for speech-to-text
tasks by cascading the speech encoder, the unit encoder, and the text decoder into an end-to-end model.
text decoder, decoupling the conventional model
into two pre-trained speech-to-unit and unit-to-text
models.
3 SpeechUT
Figure 2shows the overall framework of SpeechUT,
which leverages the unit representation as the
bridge between speech and text. In this section, we
will introduce the model architecture, pre-training,
and fine-tuning methods.
3.1 Model Architecture
As illustrated in Figure 2(a), SpeechUT mainly
contains a speech encoder, a unit encoder, and a
text decoder. In addition, speech and unit pre-nets
pre-process the input waveform and the text tokens
into fixed-dimensional hidden states, respectively.
Speech/Unit Pre-nets
The speech pre-net is a
stack of 1-D convolutional layers with 512 chan-
nels and kernel sizes of [10,3,3,3,3,2,2]. The
overall downsampling rate is 320. Given a 16K
Hz speech waveform, the speech pre-net will
convert it into a sequence of speech features,
X= (x1, x2, . . . , xT)
, where
T
is the sequence
length. The unit pre-net is a simple embedding
layer which converts a sequence of unit tokens,
Z= (z1, z2, . . . , zL)
, into latent embeddings,
U= (u1, u2, . . . , uL)
, where
L
is the sequence
length. The latent embeddings are then equipped
with learned positional encodings.
Speech Encoder
The speech encoder is a stack
of Transformer layers (Vaswani et al.,2017)
that transforms the local speech features
X
into contextualized speech hidden states,
H=
(h1, h2, . . . , hT).
Unit Encoder
The unit encoder has the same ar-
chitecture and layer numbers as the speech encoder.
It is designed to align the speech hidden states
H
and the unit embeddings
U
into the same la-
tent space. The unit encoder takes two types of
input,
H
and
U
, and outputs high-level contextu-
alized representations,
Cs= (cs
1, cs
2, . . . , cs
T)
, and
Cu= (cu
1, cu
2, . . . , cu
L), respectively.
Text Decoder
The text decoder is a Transformer
decoder (Vaswani et al.,2017) consisting of a text
embedding layer, stacked Transformer layers, and
a text output layer. It is used to generate the target
text sequence
Y= (y1, y2, . . . , y|Y|)
from left to
right according to the output of the unit encoder.
3.2 Pre-Training Tasks
To pre-train the components of SpeechUT, we pro-
pose three pre-training tasks:
Speech-to-Unit (S2U) Task
The speech-to-unit
objective is similar to HuBERT (Hsu et al.,2021),
where the model needs to predict the unit of the
masked positions based on the non-mask regions in
a speech sequence. Particularly, SpeechUT enables
this prediction for both the output of the speech
encoder (
H
) and the output of the unit encoder
摘要:

SpeechUT:BridgingSpeechandTextwithHidden-UnitforEncoder-DecoderBasedSpeech-TextPre-trainingZiqiangZhang1;,LongZhou2;y,JunyiAo3;,ShujieLiu2,LirongDai1,JinyuLi2,FuruWei21UniversityofScienceandTechnologyofChina2Microsoft3TheChineseUniversityofHongKong,ShenzhenAbstractTherapiddevelopmentofsingle-modal...

展开>> 收起<<
SpeechUT Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training Ziqiang Zhang1 Long Zhou2y Junyi Ao3 Shujie Liu2 Lirong Dai1 Jinyu Li2 Furu Wei2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.17MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注