SpeechUT Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training Ziqiang Zhang1 Long Zhou2y Junyi Ao3 Shujie Liu2 Lirong Dai1 Jinyu Li2 Furu Wei2

2025-05-03 0 0 2.17MB 14 页 10玖币

侵权投诉

SpeechUT: Bridging Speech and Text with Hidden-Unit for

Encoder-Decoder Based Speech-Text Pre-training

Ziqiang Zhang1,∗

, Long Zhou2,†, Junyi Ao3,∗, Shujie Liu2, Lirong Dai1, Jinyu Li2, Furu Wei2

1University of Science and Technology of China

2Microsoft

3The Chinese University of Hong Kong, Shenzhen

Abstract

The rapid development of single-modal pre-

training has prompted researchers to pay more

attention to cross-modal pre-training meth-

ods. In this paper, we propose a uniﬁed-

modal speech-unit-text pre-training model,

SpeechUT, to connect the representations of

a speech encoder and a text decoder with a

shared unit encoder. Leveraging hidden-unit

as an interface to align speech and text, we

can decompose the speech-to-text model into a

speech-to-unit model and a unit-to-text model,

which can be jointly pre-trained with unpaired

speech and text data respectively. Our pro-

posed SpeechUT is ﬁne-tuned and evaluated

on automatic speech recognition (ASR) and

speech translation (ST) tasks. Experimen-

tal results show that SpeechUT gets substan-

tial improvements over strong baselines, and

achieves state-of-the-art performance on both

the LibriSpeech ASR and MuST-C ST tasks.

To better understand the proposed SpeechUT,

detailed analyses are conducted. The code

and pre-trained models are available at https:

//aka.ms/SpeechUT.

1 Introduction

Self-supervised pre-training with large-scale unla-

beled data obtains remarkable progress on various

downstream tasks (Devlin et al.,2019;Radford

et al.,2019;Dong et al.,2019;Baevski et al.,2020;

Hsu et al.,2021;Chen et al.,2021). Speciﬁcally,

pre-trained models, such as BERT (Devlin et al.,

2019) and GPT (Radford et al.,2019), have exten-

sively promoted the development of natural lan-

guage processing (NLP). Researchers also develop

many pre-trained speech models utilizing a mass

of unlabeled audio data, e.g., wav2vec (Baevski

et al.,2020) and HuBERT (Hsu et al.,2021). Al-

though text and speech are two different modali-

ties, they have a natural relationship because they

∗

Work done during internship at Microsoft Research Asia.

†Corresponding author.

Speech Unit

Unit Te xt

Pre-train

Fine-tune

Shared

Figure 1: A high-level illustration of SpeechUT. After

pre-trained with speech-to-unit and unit-to-text tasks

(blue arrows), the model with a shared unit encoder en-

ables speech-to-text tasks for ﬁne-tuning (red arrow).

can be viewed as two kinds of expressions of lan-

guage. Hence, joint pre-training of speech and text

has received increasing attention from the research

community in recent years (Ao et al.,2022a;Bapna

et al.,2021;Tang et al.,2022).

One line of speech-text joint pre-training builds

a shared encoder to learn speech and text represen-

tation jointly, such as SLAM (Bapna et al.,2021),

which needs a random initialization of the de-

coder parameter for ﬁne-tuning an encoder-decoder

model. Another line of studies, e.g., SpeechT5 (Ao

et al.,2022a) and STPT (Tang et al.,2022), directly

pre-trains an encoder-decoder model on speech and

text corpus to boost the performance of automatic

speech recognition (ASR) and speech translation

(ST), leveraging unsupervised vector quantization

(van den Oord et al.,2017) and supervised speech-

text data to encourage the alignment of speech and

text respectively. For these cross-modal speech-

to-text models, a key problem is how to naturally

connect the speech encoder and the text decoder.

Our preliminary observation shows that an in-

termediate hidden-unit representation (Hsu et al.,

2021) can be regarded as the bridge between speech

and text modalities, and it can provide a strong map-

ping relationship with both of them (see Appendix

A). This inspires us to leverage hidden-unit as

the semantic interface between the speech encoder

and the text decoder in the encoder-decoder frame-

work, and decompose the speech-to-text model into

arXiv:2210.03730v1 [cs.CL] 7 Oct 2022

a speech-to-unit model and a unit-to-text model,

which can be pre-trained with unpaired speech and

text data respectively, as shown in Figure 1.

In this paper, we propose a uniﬁed speech-

unit-text pre-training method (

SpeechUT

), using

hidden-unit representation as a bridge between the

speech-encoder and the text-decoder. SpeechUT

leverages three unsupervised pre-training tasks, in-

cluding a speech-to-unit (S2U) task to model the

mapping between speech and unit like HuBERT,

masked unit modeling (MUM) task to learn better

unit representation, and a unit-to-text (U2T) task to

recover text from middle shared hidden-unit repre-

sentation. To generate training data for S2U, MUM,

and U2T, two off-line generators trained with a

small amount of paired data (100h) are introduced

to produce discrete unit sequences for large-scale

unpaired speech and text. Experiments are con-

ducted on two typical speech-to-text tasks, ASR

and ST, followed by principal analysis to better un-

derstand the proposed method. The contributions

of this paper are summarized as follows,

•

We propose a uniﬁed speech-text pre-training

method SpeechUT to bridge the speech en-

coder and the text decoder with hidden units.

•

We decouple the speech-to-text model into

speech-to-unit and unit-to-text models, to

take advantage of a large amount of unpaired

speech and text data for pre-training.

•

Our proposed SpeechUT achieves state-of-the-

art performance in downstream speech recog-

nition and speech translation tasks.

2 Related Work

The proposed SpeechUT is built upon the Trans-

former encoder-decoder model (Vaswani et al.,

2017) and relates to discrete speech representa-

tion learning and joint speech-text pre-training. We

discuss these topics in the following.

Discrete Speech Representation Learning

Dis-

cretizing continuous speech signals for speech rep-

resentation learning has drawn substantial attention.

Vq-wav2vec (Baevski et al.,2019) and wav2vec 2.0

(Baevski et al.,2020) attempt at discretizing speech

signals into quantized units from a learnable code-

book (van den Oord et al.,2017). PBERT (Wang

et al.,2022a) instead uses phonemes as the discrete

targets in a semi-supervised setting. SemFace (Ren

et al.,2021) proposes to use language-independent

vector quantized units as the semantic interface

of encoder pre-training and decoder pre-training.

Inspired by the masked language model in BERT

(Devlin et al.,2019), HuBERT (Hsu et al.,2021)

ﬁrst introduces the masked speech prediction of

hidden units to pre-train a universal speech model.

Particularly, the hidden units can be clustered from

log Mel-ﬁlterbank features or the hidden states of

the previous pre-trained model. Recently, some

studies explore leveraging the discrete hidden units

to build speech-to-speech translation systems (Lee

et al.,2021a,b), which ﬁrst convert source speech

into target units, then generate the target waveform

from predicted units. However, our goal in this

paper is to jointly pre-train speech and text with

the hidden units as the intermediate bridge.

Joint Speech-Text Pre-Training

Single-modal

pre-trained models have achieved remarkable re-

sults in both natural language processing and spo-

ken language processing, such as BERT (Vaswani

et al.,2017), UniLM (Dong et al.,2019), XLNet

(Yang et al.,2019), wav2vec 2.0 (Baevski et al.,

2020), HuBERT (Hsu et al.,2021), and WavLM

(Chen et al.,2021). Thanks to the rapid devel-

opment of these single-modal pre-training works,

researchers begin to pre-train a cross-modal model

with both speech and text data (Chung et al.,2021b;

Kim et al.,2021;Qian et al.,2021;Ao et al.,2022a;

Bapna et al.,2021;Zhang et al.,2022b;Tang et al.,

2022). One category of these works focuses on

pre-training a uniﬁed encoder model for spoken

language understanding (Chung et al.,2021b;Kim

et al.,2021;Qian et al.,2021;Zhang et al.,2022a).

In parallel to our work, SpeechLM (Zhang et al.,

2022a) leverages two kinds of tokenizers to tok-

enize speech and text, and aims at unifying speech

and text modalities into the same semantic space

within one encoder model. When ﬁne-tuning an

encoder-decoder model, a randomly initialized de-

coder needs to be superimposed on the encoder

for speech-to-text tasks (Bapna et al.,2021,2022).

Besides, Maestro (Chen et al.,2022) utilizes paired

speech-text data to learn speech-text alignment

through a modality-matching algorithm in RNN-

T framework. Our proposed SpeechUT model is

most related to encoder-decoder pre-trained mod-

els like SpeechT5 (Ao et al.,2022a) and STPT

(Tang et al.,2022), in which speech and text are di-

rectly connected by a shared encoder. Unlike them,

SpeechUT leverages hidden units (Hsu et al.,2021)

as the bridge between the speech encoder and the

Speech Encoder

Speech Pre-net

Masking

Unit Encoder

17 296 20 317 20 34..

T2U Generator

W H A T | I S | T H A T

𝑆−

𝑆−

off-line

S2U Generator

 𝑆−

𝑆−

Unit Pre-net

Speech Encoder

Speech Pre-net

Masking

Unit Encoder

Text Decoder

I T ’ S | S P E E C H U T

Text Decoder

(a) Pre-training pipeline of SpeechUT (b) Fine-tuning for ASR/ST

13 46 316 20 18 296..

Masking

















Emb Mixing

Figure 2: (a) The overall framework of SpeechUT, which is pre-trained with the speech-to-unit (S2U) task, the

masked unit modeling (MUM) task and the unit-to-text (U2T) task jointly. The discrete units are extracted from

off-line speech-to-unit (S2U) and text-to-unit (T2U) generators. (b) Fine-tuning is performed for speech-to-text

tasks by cascading the speech encoder, the unit encoder, and the text decoder into an end-to-end model.

text decoder, decoupling the conventional model

into two pre-trained speech-to-unit and unit-to-text

models.

3 SpeechUT

Figure 2shows the overall framework of SpeechUT,

which leverages the unit representation as the

bridge between speech and text. In this section, we

will introduce the model architecture, pre-training,

and ﬁne-tuning methods.

3.1 Model Architecture

As illustrated in Figure 2(a), SpeechUT mainly

contains a speech encoder, a unit encoder, and a

text decoder. In addition, speech and unit pre-nets

pre-process the input waveform and the text tokens

into ﬁxed-dimensional hidden states, respectively.

Speech/Unit Pre-nets

The speech pre-net is a

stack of 1-D convolutional layers with 512 chan-

nels and kernel sizes of [10,3,3,3,3,2,2]. The

overall downsampling rate is 320. Given a 16K

Hz speech waveform, the speech pre-net will

convert it into a sequence of speech features,

X= (x1, x2, . . . , xT)

, where

is the sequence

length. The unit pre-net is a simple embedding

layer which converts a sequence of unit tokens,

Z= (z1, z2, . . . , zL)

, into latent embeddings,

U= (u1, u2, . . . , uL)

, where

is the sequence

length. The latent embeddings are then equipped

with learned positional encodings.

Speech Encoder

The speech encoder is a stack

of Transformer layers (Vaswani et al.,2017)

that transforms the local speech features

into contextualized speech hidden states,

(h1, h2, . . . , hT).

Unit Encoder

The unit encoder has the same ar-

chitecture and layer numbers as the speech encoder.

It is designed to align the speech hidden states

and the unit embeddings

into the same la-

tent space. The unit encoder takes two types of

input,

and

, and outputs high-level contextu-

alized representations,

Cs= (cs

1, cs

2, . . . , cs

, and

Cu= (cu

1, cu

2, . . . , cu

L), respectively.

Text Decoder

The text decoder is a Transformer

decoder (Vaswani et al.,2017) consisting of a text

embedding layer, stacked Transformer layers, and

a text output layer. It is used to generate the target

text sequence

Y= (y1, y2, . . . , y|Y|)

from left to

right according to the output of the unit encoder.

3.2 Pre-Training Tasks

To pre-train the components of SpeechUT, we pro-

pose three pre-training tasks:

Speech-to-Unit (S2U) Task

The speech-to-unit

objective is similar to HuBERT (Hsu et al.,2021),

where the model needs to predict the unit of the

masked positions based on the non-mask regions in

a speech sequence. Particularly, SpeechUT enables

this prediction for both the output of the speech

encoder (

) and the output of the unit encoder

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SpeechUT:BridgingSpeechandTextwithHidden-UnitforEncoder-DecoderBasedSpeech-TextPre-trainingZiqiangZhang1;,LongZhou2;y,JunyiAo3;,ShujieLiu2,LirongDai1,JinyuLi2,FuruWei21UniversityofScienceandTechnologyofChina2Microsoft3TheChineseUniversityofHongKong,ShenzhenAbstractTherapiddevelopmentofsingle-modal...

展开>> 收起<<

SpeechUT Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training Ziqiang Zhang1 Long Zhou2y Junyi Ao3 Shujie Liu2 Lirong Dai1 Jinyu Li2 Furu Wei2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SpeechUT Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training Ziqiang Zhang1 Long Zhou2y Junyi Ao3 Shujie Liu2 Lirong Dai1 Jinyu Li2 Furu Wei2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: