a speech-to-unit model and a unit-to-text model,
which can be pre-trained with unpaired speech and
text data respectively, as shown in Figure 1.
In this paper, we propose a unified speech-
unit-text pre-training method (
SpeechUT
), using
hidden-unit representation as a bridge between the
speech-encoder and the text-decoder. SpeechUT
leverages three unsupervised pre-training tasks, in-
cluding a speech-to-unit (S2U) task to model the
mapping between speech and unit like HuBERT,
masked unit modeling (MUM) task to learn better
unit representation, and a unit-to-text (U2T) task to
recover text from middle shared hidden-unit repre-
sentation. To generate training data for S2U, MUM,
and U2T, two off-line generators trained with a
small amount of paired data (100h) are introduced
to produce discrete unit sequences for large-scale
unpaired speech and text. Experiments are con-
ducted on two typical speech-to-text tasks, ASR
and ST, followed by principal analysis to better un-
derstand the proposed method. The contributions
of this paper are summarized as follows,
•
We propose a unified speech-text pre-training
method SpeechUT to bridge the speech en-
coder and the text decoder with hidden units.
•
We decouple the speech-to-text model into
speech-to-unit and unit-to-text models, to
take advantage of a large amount of unpaired
speech and text data for pre-training.
•
Our proposed SpeechUT achieves state-of-the-
art performance in downstream speech recog-
nition and speech translation tasks.
2 Related Work
The proposed SpeechUT is built upon the Trans-
former encoder-decoder model (Vaswani et al.,
2017) and relates to discrete speech representa-
tion learning and joint speech-text pre-training. We
discuss these topics in the following.
Discrete Speech Representation Learning
Dis-
cretizing continuous speech signals for speech rep-
resentation learning has drawn substantial attention.
Vq-wav2vec (Baevski et al.,2019) and wav2vec 2.0
(Baevski et al.,2020) attempt at discretizing speech
signals into quantized units from a learnable code-
book (van den Oord et al.,2017). PBERT (Wang
et al.,2022a) instead uses phonemes as the discrete
targets in a semi-supervised setting. SemFace (Ren
et al.,2021) proposes to use language-independent
vector quantized units as the semantic interface
of encoder pre-training and decoder pre-training.
Inspired by the masked language model in BERT
(Devlin et al.,2019), HuBERT (Hsu et al.,2021)
first introduces the masked speech prediction of
hidden units to pre-train a universal speech model.
Particularly, the hidden units can be clustered from
log Mel-filterbank features or the hidden states of
the previous pre-trained model. Recently, some
studies explore leveraging the discrete hidden units
to build speech-to-speech translation systems (Lee
et al.,2021a,b), which first convert source speech
into target units, then generate the target waveform
from predicted units. However, our goal in this
paper is to jointly pre-train speech and text with
the hidden units as the intermediate bridge.
Joint Speech-Text Pre-Training
Single-modal
pre-trained models have achieved remarkable re-
sults in both natural language processing and spo-
ken language processing, such as BERT (Vaswani
et al.,2017), UniLM (Dong et al.,2019), XLNet
(Yang et al.,2019), wav2vec 2.0 (Baevski et al.,
2020), HuBERT (Hsu et al.,2021), and WavLM
(Chen et al.,2021). Thanks to the rapid devel-
opment of these single-modal pre-training works,
researchers begin to pre-train a cross-modal model
with both speech and text data (Chung et al.,2021b;
Kim et al.,2021;Qian et al.,2021;Ao et al.,2022a;
Bapna et al.,2021;Zhang et al.,2022b;Tang et al.,
2022). One category of these works focuses on
pre-training a unified encoder model for spoken
language understanding (Chung et al.,2021b;Kim
et al.,2021;Qian et al.,2021;Zhang et al.,2022a).
In parallel to our work, SpeechLM (Zhang et al.,
2022a) leverages two kinds of tokenizers to tok-
enize speech and text, and aims at unifying speech
and text modalities into the same semantic space
within one encoder model. When fine-tuning an
encoder-decoder model, a randomly initialized de-
coder needs to be superimposed on the encoder
for speech-to-text tasks (Bapna et al.,2021,2022).
Besides, Maestro (Chen et al.,2022) utilizes paired
speech-text data to learn speech-text alignment
through a modality-matching algorithm in RNN-
T framework. Our proposed SpeechUT model is
most related to encoder-decoder pre-trained mod-
els like SpeechT5 (Ao et al.,2022a) and STPT
(Tang et al.,2022), in which speech and text are di-
rectly connected by a shared encoder. Unlike them,
SpeechUT leverages hidden units (Hsu et al.,2021)
as the bridge between the speech encoder and the