(Ao et al.,2022;Hsu et al.,2021;Wu et al.,2022;
Nguyen et al.,2022), speech resynthesis (Polyak
et al.,2021), spoken language generation (Lakhotia
et al.,2021;Kharitonov et al.,2022b), speech-to-
speech translation (Lee et al.,2021;Popuri et al.,
2022;Lee et al.,2022;Wu et al.,2022), and spoken
named entity recognition (Wu et al.,2022). Follow-
ing these prior studies, our work also falls in the
domain of ‘textless NLP’.
Unsupervised sentence embeddings
Learning
semantic sentence embedding has been extensively
studied in NLP community, such as Skip-Thought
vectors (Kiros et al.,2015), InferSent (Conneau
et al.,2017), Universal Sentence Encoder (Cer
et al.,2018), and SBERT (Reimers and Gurevych,
2019). Recently, unsupervised sentence embed-
dings have considerably narrowed the performance
gap between unsupervised and supervised methods.
Contrastive learning has been utilized to learn a vec-
tor space in which semantically similar sentences
are close to each other, such as DeCLUTR (Giorgi
et al.,2021), SimCSE (Gao et al.,2021), TransEn-
coder (Liu et al.,2021) and DiffCSE (Chuang et al.,
2022). Another approach relies on autoencoders to
compress a sentence into a latent vector represen-
tation and then reconstruct the original sentence,
such as VGVAE (Chen et al.,2019) and TSDAE
(Wang et al.,2021).
Most unsupervised methods are based on tex-
tual sentences. In speech, current studies tend to
center on acoustic word embeddings (e.g., Kamper
et al.,2016;Settle and Livescu,2016;Settle et al.,
2017;Holzenberger et al.,2018;Kamper,2019).
Despite the progress, learning sentence-level em-
beddings for speechstill remains under-explored.
In SUPERB benchmark for evaluating speech rep-
resentations (Yang et al.,2021), spoken sentence
similarity ranking is not yet listed as a downstream
task. In recent works, it has been shown that spo-
ken sentence semantic similarities can be learned
via the visually grounded speech models (Merkx
et al.,2021). Multilingual spoken sentence embed-
dings can also be learned by using supervised mul-
tilingual text models as teacher models (Duquenne
et al.,2021;Khurana et al.,2022). These methods
more or less relied on labeled data such as speech-
image pairs or multilingual sentence pairs. How-
ever, we propose unsupervised methods to induce
semantic embeddings from speech signals only, and
our methods can also utilize textual transcriptions
to improve performance if they are available.
3 Method
Task formulation
The current task is to encode
spoken utterances into low dimensional dense
vectors such that semantically similar utterances
are close to each other in the learned latent
space. Given a speech signal
x∈R1×N=
[x1, x2, . . . , xN]
, our goal is to learn a neural net-
work function
fenc
that converts
x
to a fixed-
dimensional vector
z∈Rd=fenc(x)
, such that
z
encodes the semantic content of the original
signal
x
. For a certain semantically similar pair
{
z,z+
} and a semantically dissimilar pair {
z,z−
}
(as determined by human raters), it is expected
that
sim(z,z+)> sim(z,z−)
, where
sim()
is a
similarity scoring function.
It is further assumed that some forms of tran-
scriptions of the original speech signal exist. Usu-
ally, a transcription of
x
take the form of a textual
sequence
y∈R1×M= [y1, y2, . . . , yM], N > M
.
Such data are sometimes available as most speech
datasets for ASR and TTS are organized as pairs
of speech and texts. However, in most scenarios
the textual transcriptions are not available or too
costly to create. In these cases, the transcriptions
can be in the form of pseudo-units
ˆy∈R1×L=
[ˆy1,ˆy2,...,ˆyL], N > L
, which could be generated
by an unsupervised system for acoustic unit dis-
covery. During training, these transcripts are used
as the targets for the proxy tasks. However, in in-
ference, the pretrained model can directly project
speech into semantic embeddings.
3.1 Discretizing speech signals
Acoustic unit discovery refers to the task of seg-
menting speech signals into discrete word-like or
phone-like units (e.g., Lee and Glass,2012;Lee
et al.,2015;Ondel et al.,2016;Kamper,2019;van
Niekerk et al.,2020). Annotating speech signals
can sometimes be prohibitively costly for many
languages and application domains. Unsupervised
discovery of acoustic units can be used as a proxy
of transcriptions to train speech systems, if the
discovered acoustic units are consistent representa-
tions of speech. In our approach, acoustic units are
treated as ‘pseudo-texts" to bootstrap the learning
of semantic representations.
We used pretrained speech transformer, Hu-
BERT, to discretize speech signals into ‘hidden
units’, which were proposed in Baevski et al. (2021)
and Lakhotia et al. (2021). After passing speech
signals into HuBERT, the hidden states of the sixth