Bootstrapping meaning through listening Unsupervised learning of spoken sentence embeddings Jian ZhuB Zuoyu TianX Yadong Liu Cong ZhangQ Chia-wen LoM

2025-05-06 2 0 1.01MB 22 页 10玖币
侵权投诉
Bootstrapping meaning through listening: Unsupervised learning of
spoken sentence embeddings
Jian ZhuB,œ, Zuoyu TianX, Yadong Liuœ, Cong ZhangQ, Chia-wen LoM
BUniversity of Michigan, Ann Arbor œUniversity of British Columbia
XIndiana University Bloomington QNewcastle University
MMax Planck Institute for Human Cognitive and Brain Sciences
Blingjzhu@umich.edu,Xzuoytian@iu.edu
Abstract
Inducing semantic representations directly
from speech signals is a highly challeng-
ing task but has many useful applications in
speech mining and spoken language under-
standing. This study tackles the unsupervised
learning of semantic representations for spo-
ken utterances. Through converting speech
signals into hidden units generated from acous-
tic unit discovery, we propose WavEmbed, a
multimodal sequential autoencoder that pre-
dicts hidden units from a dense representa-
tion of speech. Secondly, we also propose S-
HuBERT to induce meaning through knowl-
edge distillation, in which a sentence embed-
ding model is first trained on hidden units
and passes its knowledge to a speech encoder
through contrastive learning. The best per-
forming model achieves a moderate correla-
tion (0.50.6) with human judgments, without
relying on any labels or transcriptions. Further-
more, these models can also be easily extended
to leverage textual transcriptions of speech to
learn much better speech embeddings that are
strongly correlated with human annotations.
Our proposed methods are applicable to the
development of purely data-driven systems for
speech mining, indexing and search.
1 Introduction
In Spoken Language Understanding (SLU), a goal
is to understand the semantic content of spoken
utterances. Traditionally, research in speech pro-
cessing focus on tasks that process the low-level
sensory information in speech, such as automatic
speech processing (ASR), under the assumption
that language understanding can be handled by
NLP modules after speech is transcribed (Wang
et al.,2005;De Mori et al.,2008;Serdyuk et al.,
2018). Yet speech-based semantic representations
allow us to bypass texts in some scenarios, not only
simplifying the pipeline and but also beneficial for
certain domains without much transcribed data or
some languages without writing systems.
For speech processing, the spoken term detec-
tion tasks such as keyword detection (e.g., Mamou
et al.,2007;Miller et al.,2007;Can and Saraclar,
2011;Wang et al.,2018) and query-by-example
search (e.g., Hazen et al.,2009;Parada et al.,2009;
Chen et al.,2015) focus on the exact matching of
audio terms in speech databases. Yet the speech-
to-speech search enabled by spoken sentence em-
beddings can further expand our capacity to search
speech in meaning rather than only in form. This
capacity marks a significant advancement in the ma-
chine’s capacity to perform speech mining, voice
search and indexing and spoken information re-
trieval (Duquenne et al.,2021).
While learning textual sentence similarity is a
classic task in NLP (e.g., Agirre et al.,2012,2015,
2016;Cer et al.,2017), the task is still relatively
unexplored in speech research. The main challenge
in learning spoken sentence embeddings lies in the
lack of labeled data for supervised learning. Given
the costs associated by creating semantic ratings,
it is important to explore unsupervised methods
to induce semantic representations directly from
speech signals.
In this study, we present two approaches to
tackle the challenge of inducing semantic repre-
sentations directly from speech signals without any
semantic labeling. The first model, Waveform Em-
bedding Transformer (WavEmbed) (Figure 1), is
a multimodal sequential autoencoder that encodes
a speech signal into a bottleneck vector and re-
constructs a sequence of ‘hidden units’, which are
generated using unsupervised acoustic unit dis-
covery. The second model, Sentence HuBERT
(S-HuBERT), learns the semantic representation
through aligning with a frozen unsupervised text
embedding model, which is trained with the hidden
units (Figure 2). We make the following contribu-
tions.
We propose simple yet effective unsuper-
vised methods to learn spoken sentence rep-
arXiv:2210.12857v1 [cs.CL] 23 Oct 2022
Figure 1: The architecture of WavEmbed. WavEmbed first projects a speech signal into a fixed-dimensional vector
representation, and then decodes it back to discrete acoustic units, which are generated through clustering on the
hidden states from the sixth layer of the (frozen) pretrained HuBERT model. The learned fixed-dimensional vector
encodes semantic information in the latent space. No texts are required in this training loop. However, if text
transcripts are available, the decoder targets can also be textual sequences.
Figure 2: Illustration of S-HuBERT. An unsupervised
text embedding model is first trained on either hidden
units or textual transcripts. Then it is used as a teacher
model to transfer semantic knowledge to a speech en-
coder through contrastive model distillation.
resentations. Our best performing unsuper-
vised model achieves moderate Spearman’s
rank correlations (0.5
0.6) with human judge-
ments without relying on any labels or text
transcriptions.
Our proposed methods can be easily extended
to speech-text pairs to enhance performance.
With text transcriptions, the performance can
further be increased to 0.7
0.8 in terms of
Spearman’s correlation. We made extensive
comparisons and analyses of model perfor-
mance under different conditions.
We have also created a speech dataset for
evaluating spoken sentence similarities, which
were rated by multiple human raters and en-
compassed various speech accents to measure
the robustness of models.
Our code, data and pretrained checkpoints nec-
essary for replicating the experiments are avail-
able at
https://github.com/lingjzhu/
spoken_sent_embedding.
2 Background
Self-supervised speech modeling
Most speech
technologies including ASR and text-to-speech syn-
thesis (TTS) nowadays heavily rely on the availabil-
ity of text transcripts. Yet such textual resources
can sometimes be hard to collect for many lan-
guages, some of which might not have writing
systems. Many efforts have since been made to
explore effective methods to learn speech repre-
sentations directly from speech signals, such as
the ZeroSpeech Workshop (Versteegh et al.,2015;
Dunbar et al.,2017,2019,2020,2021).
Recently, large-scale self-supervised models in-
cluding CPC (Oord et al.,2018), Wav2Vec (Schnei-
der et al.,2019), Wav2Vec2 (Baevski et al.,2020),
HuBERT (Hsu et al.,2021) and WavLM (Chen
et al.,2021) have learned effective speech repre-
sentations that can benefit a wide range of down-
stream speech tasks (Yang et al.,2021). In par-
ticular, Hsu et al. (2021) proposed using cluster-
ing algorithm to cluster hidden states of HuBERT
into hidden units, which were then used to cre-
ate training masks. These clusters were shown
to encode rich phonemic information (Hsu et al.,
2021;Baevski et al.,2021). Later it is found that
discretizing speech into ‘hidden units’ allows the
application of NLP algorithms to process speech
via the proxy of these discrete hidden units, without
the need of actual textual transcriptions (‘textless
NLP’) (Lakhotia et al.,2021;Nguyen et al.,2022).
This discovery has greatly benefited a variety of
tasks, some of which were traditionally not per-
formed with speech, including unsupervised ASR
(Baevski et al.,2021), spoken language modeling
(Ao et al.,2022;Hsu et al.,2021;Wu et al.,2022;
Nguyen et al.,2022), speech resynthesis (Polyak
et al.,2021), spoken language generation (Lakhotia
et al.,2021;Kharitonov et al.,2022b), speech-to-
speech translation (Lee et al.,2021;Popuri et al.,
2022;Lee et al.,2022;Wu et al.,2022), and spoken
named entity recognition (Wu et al.,2022). Follow-
ing these prior studies, our work also falls in the
domain of ‘textless NLP’.
Unsupervised sentence embeddings
Learning
semantic sentence embedding has been extensively
studied in NLP community, such as Skip-Thought
vectors (Kiros et al.,2015), InferSent (Conneau
et al.,2017), Universal Sentence Encoder (Cer
et al.,2018), and SBERT (Reimers and Gurevych,
2019). Recently, unsupervised sentence embed-
dings have considerably narrowed the performance
gap between unsupervised and supervised methods.
Contrastive learning has been utilized to learn a vec-
tor space in which semantically similar sentences
are close to each other, such as DeCLUTR (Giorgi
et al.,2021), SimCSE (Gao et al.,2021), TransEn-
coder (Liu et al.,2021) and DiffCSE (Chuang et al.,
2022). Another approach relies on autoencoders to
compress a sentence into a latent vector represen-
tation and then reconstruct the original sentence,
such as VGVAE (Chen et al.,2019) and TSDAE
(Wang et al.,2021).
Most unsupervised methods are based on tex-
tual sentences. In speech, current studies tend to
center on acoustic word embeddings (e.g., Kamper
et al.,2016;Settle and Livescu,2016;Settle et al.,
2017;Holzenberger et al.,2018;Kamper,2019).
Despite the progress, learning sentence-level em-
beddings for speechstill remains under-explored.
In SUPERB benchmark for evaluating speech rep-
resentations (Yang et al.,2021), spoken sentence
similarity ranking is not yet listed as a downstream
task. In recent works, it has been shown that spo-
ken sentence semantic similarities can be learned
via the visually grounded speech models (Merkx
et al.,2021). Multilingual spoken sentence embed-
dings can also be learned by using supervised mul-
tilingual text models as teacher models (Duquenne
et al.,2021;Khurana et al.,2022). These methods
more or less relied on labeled data such as speech-
image pairs or multilingual sentence pairs. How-
ever, we propose unsupervised methods to induce
semantic embeddings from speech signals only, and
our methods can also utilize textual transcriptions
to improve performance if they are available.
3 Method
Task formulation
The current task is to encode
spoken utterances into low dimensional dense
vectors such that semantically similar utterances
are close to each other in the learned latent
space. Given a speech signal
xR1×N=
[x1, x2, . . . , xN]
, our goal is to learn a neural net-
work function
fenc
that converts
x
to a fixed-
dimensional vector
zRd=fenc(x)
, such that
z
encodes the semantic content of the original
signal
x
. For a certain semantically similar pair
{
z,z+
} and a semantically dissimilar pair {
z,z
}
(as determined by human raters), it is expected
that
sim(z,z+)> sim(z,z)
, where
sim()
is a
similarity scoring function.
It is further assumed that some forms of tran-
scriptions of the original speech signal exist. Usu-
ally, a transcription of
x
take the form of a textual
sequence
yR1×M= [y1, y2, . . . , yM], N > M
.
Such data are sometimes available as most speech
datasets for ASR and TTS are organized as pairs
of speech and texts. However, in most scenarios
the textual transcriptions are not available or too
costly to create. In these cases, the transcriptions
can be in the form of pseudo-units
ˆyR1×L=
[ˆy1,ˆy2,...,ˆyL], N > L
, which could be generated
by an unsupervised system for acoustic unit dis-
covery. During training, these transcripts are used
as the targets for the proxy tasks. However, in in-
ference, the pretrained model can directly project
speech into semantic embeddings.
3.1 Discretizing speech signals
Acoustic unit discovery refers to the task of seg-
menting speech signals into discrete word-like or
phone-like units (e.g., Lee and Glass,2012;Lee
et al.,2015;Ondel et al.,2016;Kamper,2019;van
Niekerk et al.,2020). Annotating speech signals
can sometimes be prohibitively costly for many
languages and application domains. Unsupervised
discovery of acoustic units can be used as a proxy
of transcriptions to train speech systems, if the
discovered acoustic units are consistent representa-
tions of speech. In our approach, acoustic units are
treated as ‘pseudo-texts" to bootstrap the learning
of semantic representations.
We used pretrained speech transformer, Hu-
BERT, to discretize speech signals into ‘hidden
units’, which were proposed in Baevski et al. (2021)
and Lakhotia et al. (2021). After passing speech
signals into HuBERT, the hidden states of the sixth
layer were extracted and a k-means clustering algo-
rithm was applied on the hidden states to quantize
them into discrete clusters. The sequence of cluster
indexes, after deduplication by merging consecu-
tive same indexes, are the hidden units representing
the original speech (see Figure 1). The discrete
hidden units remove certain paralinguistic and non-
linguistic variations such as speaker voice traits
and background noises, so they can be considered
a normalized representations of the speech content
(though many phonetic variations are still present)
(Lee et al.,2021).
We used the
textless-lib
(Kharitonov
et al.,2022a) to convert speech signals
into discrete hidden units. We selected
hubert-base-ls960
as the base speech
encoder and set the number of clusters to 50,
100 and 200. After speech were discretized
into sequences of hidden units, sentence-piece
tokenizers (Kudo and Richardson,2018) were
trained on them to shorten the sequence length
(see Appendix A). There is evidence showing that
re-tokenzing hidden units are generally beneficial
for language modeling and downstream tasks (Ren
et al.,2022;Wu et al.,2022).
3.2 S-HuBERT
The first approach, S-HuBERT, is to transfer the
knowledge of a well-learned text embedding model
to a speech embedding model (Duquenne et al.,
2021;Khurana et al.,2022), in which pretrained
supervised textual embeddings are adopted as the
teacher models and speech models are trained to
align with the text embeddings in the same latent
space.
Here we also extend this approach to the unsu-
pervised learning domain. The proposed utilizes
an unsupervised sentence embedding model with
transcriptions, and then transfers the knowledge of
a textual sentence embedding model to an acoustic
sentence embedding model (S-HuBERT) by lever-
aging the correspondence between speech and its
transcriptions. In the absence of textual transcrip-
tions, the hidden units can be processed as pseudo-
texts to induce unsupervised meaning embeddings.
We mainly investigate two approaches to train
unsupervised (pseudo-)text embedding models,
namely, SimCSE (Gao et al.,2021) and TSDAE
(Wang et al.,2021). If these two types of models
are trained with hidden units, they are referred to as
Hu-SimCSE and Hu-TSDAE respectively, in order
to distinguish them from the text-based models.
SimCSE
The unsupervised SimCSE (Gao et al.,
2021) is a contrastive learning framework for tex-
tual sentence embeddings. It takes a sentence as
input and uses the same sentence as the target with
only the dropout noises. As pretrained transformers
such as BERT and RoBERTa apply a dropout mask
of 10%, the same sentence will result in slightly dif-
ferent hidden states in multiple passes and can be
treated as positive pairs in contrastive learning. We
trained SimCSE models to induce sentence mean-
ing from text transcripts before transferring the
knowledge to a speech model. For modeling hid-
den units, we first pretrained a BERT model on
hidden units, which were converted from the whole
speech corpus. Then Hu-SimCSE was initiated
with the pretrained hidden-unit BERT for training.
TSDAE
Transformer-based Sequential Denois-
ing AutoEncoder (TSDAE) is a denoising encoder-
decoder model that encodes a corrupted text se-
quence into a dense vector and decodes the origi-
nal text sequence. We trained text-based TSDAE
models following as closely as possible the set-
tings specified by Wang et al. (2021). However,
slightly different hyperparameters were adopted
for Hu-TSDAE. In the original TSDAE, tokens in
the input sequence are randomly deleted with a
ratio of 0.6. We found that deleting tokens in the
input hidden units significantly hurt performance
Instead, using the same uncorrupted sequence of
hidden units as both inputs and targets achieved
much better performance in our hyperparameter
tuning experiments (see Appendix D.1).
Language modeling on discrete units
Both
SimCSE and TSDAE models were intialized with
pretrained transformer checkpoints. In addition
to publicly available text-based pretrained mdoels,
we also pretrained hidden-unit based pretrained
transformers. Given a corpus of hidden units con-
verted from raw speech, transformer-based lan-
guage models were pretrained to learn the statisti-
cal regularities in sequences of hidden units. We
adopted the same model architecture as BERT (De-
vlin et al.,2019) (
bert-base-uncased
) and
used the masked language modeling task with a
masking rate of 15%. However, the next sentence
prediction task was discarded, because it was not
found to significantly affect the model performance
(Liu et al.,2019;Lan et al.,2019).
Knowledge distillation
We transferred the
knowledge from a pretrained textual sentence em-
bedding model (SimCSE or TSDAE) into a speech
embedding model through teacher-student training
(Duquenne et al.,2021). Here the teacher model
was pretrained text embedding model whereas the
student model was the pretrained speech model,
HuBERT (Hsu et al.,2021).
We used contrastive learning for training S-
HuBERT (Sun et al.,2020;Wu et al.,2021;Ye
et al.,2022). Given a speech embedding
zi
and its
corresponding (pseudo-)text embedding
e
z+
i
with
in-batch negative samples, the InfoNCE loss (Oord
et al.,2018) is computed as
LinfoNCE =log esim(zi,e
z+
i)
PN
j=1 esim(zi,e
z+
j)(1)
where
τ
is the temperature parameter and
sim()
is the cosine similarity function
sim(z1,z2) =
zT
1z2/||z1|| · ||z2||
.
τ
was set to 0.05 in all experi-
ments. In order to keep a large number of negative
samples, we maintained a dynamic memory bank
of negative samples (He et al.,2020). In each itera-
tion, textual representations in the last mini-batch
are enqueued into the memory bank, whereas the
oldest textual representations in the bank are de-
queued. The text model is frozen throughout train-
ing. A comparison of InfoNCE and MSE loss is
available at Table 14 in Appendix D.3.
3.3 WavEmbed
The WavEmbed is a sequential autoencoder (Vin-
cent et al.,2010;Hill et al.,2016;Wang et al.,
2021), which encodes a speech signal
x
into a
fixed-dimensional vector
z
and decodes the vec-
tor representation
z
using only the encoded vector.
The vector
z
is used as the semantic representa-
tion. The decoded discrete representations can
be actual texts
y
or sequences of hidden acous-
tic units
ˆy
. The proposed method is inspired by
the TSDAE (Wang et al.,2021), which learns ef-
fective unsupervised sentence embeddings through
a denoising encoder-decoder model that encodes a
corrupted text sequence into a dense vector and de-
codes the uncorrupted one. WavEmbed generalizes
the original TSDAE to acoustic signals and can
learn semantic representations of speech through
reconstructing not only the texts but also the hidden
acoustic units discovered unsupervisedly.
Yet WavEmbed differs from TSDAE in some as-
pects. TSDAE’s encoder and decoder components
are all text-based, whereas WavEmbed utilizes a
speech encoder. TSDAE relies on the denoising
reconstruction as a proxy task, in which the model
is trained to recover the original sentence from the
embedding of the corrupted sentence (word dele-
tion with a ratio of 0.6). However, WavEmbed
reconstructs a discrete sequence from the embed-
ding of a corresponding spoken sentence but no
corruptions except the standard dropout is applied
to the speech signal. In addition, WavEmbed uses
self-attention pooling to pool the encoder hidden
states rather than the average pooling in TSDAE, as
self-attention pooling is more effective than mean
or max pooling for sentence-level speech emebd-
dings (see Khurana et al.,2022, and Table 14 in
Appendix D.3).
The encoder
fenc
consists of two parts, a pre-
trained speech transformer
fS
for speech feature
extraction and a self-attention pooling layer for
pooling. Let
HRT×d= [h1,h2,...,hT]
be
the hidden states of a speech transformer model
fS
given a speech signal
x
. The self-attention pooling
operation (Safari et al.,2020) can be computed as:
H=fS(x)(2)
z=Softmax(W HT)H(3)
where
WRd
is a learnable parameter during
training. Given a semantic representation
z
of
speech signal
x
, the autoregressive decoder
fdec
predicts the hidden units
ˆy
that correspond to the
content of the speech signal x.
ˆy=fdec(z)(4)
The encoder-decoder model is trained with the stan-
dard negative log likelihood loss.
L=
L
X
1
log P(ˆyl|z,ˆyl1,...,ˆy1)
=
L
X
1
log P(ˆyl|fenc(x),ˆyl1,...,ˆy1)
(5)
The WavEmbed is trained to predict discrete acous-
tic units
ˆy
based on the speech signals
x
. However,
when textual transcripts for speech signals are avail-
able, the prediction targets can also be replaced
with textual sequences
y
to enhance the learning
of semantic content. Once the model is trained,
the decoder is discarded, leaving only the speech
encoder for extracting the semantic embeddings.
摘要:

Bootstrappingmeaningthroughlistening:UnsupervisedlearningofspokensentenceembeddingsJianZhuB,÷,ZuoyuTianX,YadongLiu÷,CongZhangQ,Chia-wenLoMBUniversityofMichigan,AnnArbor÷UniversityofBritishColumbiaXIndianaUniversityBloomingtonQNewcastleUniversityMMaxPlanckInstituteforHumanCognitiveandBrainSciencesBli...

展开>> 收起<<
Bootstrapping meaning through listening Unsupervised learning of spoken sentence embeddings Jian ZhuB Zuoyu TianX Yadong Liu Cong ZhangQ Chia-wen LoM.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.01MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注