Bootstrapping meaning through listening Unsupervised learning of spoken sentence embeddings Jian ZhuB Zuoyu TianX Yadong Liu Cong ZhangQ Chia-wen LoM

2025-05-06 4 0 1.01MB 22 页 10玖币

侵权投诉

Bootstrapping meaning through listening: Unsupervised learning of

spoken sentence embeddings

Jian ZhuB,œ, Zuoyu TianX, Yadong Liuœ, Cong ZhangQ, Chia-wen LoM

BUniversity of Michigan, Ann Arbor œUniversity of British Columbia

XIndiana University Bloomington QNewcastle University

MMax Planck Institute for Human Cognitive and Brain Sciences

Blingjzhu@umich.edu,Xzuoytian@iu.edu

Abstract

Inducing semantic representations directly

from speech signals is a highly challeng-

ing task but has many useful applications in

speech mining and spoken language under-

standing. This study tackles the unsupervised

learning of semantic representations for spo-

ken utterances. Through converting speech

signals into hidden units generated from acous-

tic unit discovery, we propose WavEmbed, a

multimodal sequential autoencoder that pre-

dicts hidden units from a dense representa-

tion of speech. Secondly, we also propose S-

HuBERT to induce meaning through knowl-

edge distillation, in which a sentence embed-

ding model is ﬁrst trained on hidden units

and passes its knowledge to a speech encoder

through contrastive learning. The best per-

forming model achieves a moderate correla-

tion (0.5∼0.6) with human judgments, without

relying on any labels or transcriptions. Further-

more, these models can also be easily extended

to leverage textual transcriptions of speech to

learn much better speech embeddings that are

strongly correlated with human annotations.

Our proposed methods are applicable to the

development of purely data-driven systems for

speech mining, indexing and search.

1 Introduction

In Spoken Language Understanding (SLU), a goal

is to understand the semantic content of spoken

utterances. Traditionally, research in speech pro-

cessing focus on tasks that process the low-level

sensory information in speech, such as automatic

speech processing (ASR), under the assumption

that language understanding can be handled by

NLP modules after speech is transcribed (Wang

et al.,2005;De Mori et al.,2008;Serdyuk et al.,

2018). Yet speech-based semantic representations

allow us to bypass texts in some scenarios, not only

simplifying the pipeline and but also beneﬁcial for

certain domains without much transcribed data or

some languages without writing systems.

For speech processing, the spoken term detec-

tion tasks such as keyword detection (e.g., Mamou

et al.,2007;Miller et al.,2007;Can and Saraclar,

2011;Wang et al.,2018) and query-by-example

search (e.g., Hazen et al.,2009;Parada et al.,2009;

Chen et al.,2015) focus on the exact matching of

audio terms in speech databases. Yet the speech-

to-speech search enabled by spoken sentence em-

beddings can further expand our capacity to search

speech in meaning rather than only in form. This

capacity marks a signiﬁcant advancement in the ma-

chine’s capacity to perform speech mining, voice

search and indexing and spoken information re-

trieval (Duquenne et al.,2021).

While learning textual sentence similarity is a

classic task in NLP (e.g., Agirre et al.,2012,2015,

2016;Cer et al.,2017), the task is still relatively

unexplored in speech research. The main challenge

in learning spoken sentence embeddings lies in the

lack of labeled data for supervised learning. Given

the costs associated by creating semantic ratings,

it is important to explore unsupervised methods

to induce semantic representations directly from

speech signals.

In this study, we present two approaches to

tackle the challenge of inducing semantic repre-

sentations directly from speech signals without any

semantic labeling. The ﬁrst model, Waveform Em-

bedding Transformer (WavEmbed) (Figure 1), is

a multimodal sequential autoencoder that encodes

a speech signal into a bottleneck vector and re-

constructs a sequence of ‘hidden units’, which are

generated using unsupervised acoustic unit dis-

covery. The second model, Sentence HuBERT

(S-HuBERT), learns the semantic representation

through aligning with a frozen unsupervised text

embedding model, which is trained with the hidden

units (Figure 2). We make the following contribu-

tions.

•

We propose simple yet effective unsuper-

vised methods to learn spoken sentence rep-

arXiv:2210.12857v1 [cs.CL] 23 Oct 2022

Figure 1: The architecture of WavEmbed. WavEmbed ﬁrst projects a speech signal into a ﬁxed-dimensional vector

representation, and then decodes it back to discrete acoustic units, which are generated through clustering on the

hidden states from the sixth layer of the (frozen) pretrained HuBERT model. The learned ﬁxed-dimensional vector

encodes semantic information in the latent space. No texts are required in this training loop. However, if text

transcripts are available, the decoder targets can also be textual sequences.

Figure 2: Illustration of S-HuBERT. An unsupervised

text embedding model is ﬁrst trained on either hidden

units or textual transcripts. Then it is used as a teacher

model to transfer semantic knowledge to a speech en-

coder through contrastive model distillation.

resentations. Our best performing unsuper-

vised model achieves moderate Spearman’s

rank correlations (0.5

∼

0.6) with human judge-

ments without relying on any labels or text

transcriptions.

•

Our proposed methods can be easily extended

to speech-text pairs to enhance performance.

With text transcriptions, the performance can

further be increased to 0.7

∼

0.8 in terms of

Spearman’s correlation. We made extensive

comparisons and analyses of model perfor-

mance under different conditions.

•

We have also created a speech dataset for

evaluating spoken sentence similarities, which

were rated by multiple human raters and en-

compassed various speech accents to measure

the robustness of models.

Our code, data and pretrained checkpoints nec-

essary for replicating the experiments are avail-

able at

https://github.com/lingjzhu/

spoken_sent_embedding.

2 Background

Self-supervised speech modeling

Most speech

technologies including ASR and text-to-speech syn-

thesis (TTS) nowadays heavily rely on the availabil-

ity of text transcripts. Yet such textual resources

can sometimes be hard to collect for many lan-

guages, some of which might not have writing

systems. Many efforts have since been made to

explore effective methods to learn speech repre-

sentations directly from speech signals, such as

the ZeroSpeech Workshop (Versteegh et al.,2015;

Dunbar et al.,2017,2019,2020,2021).

Recently, large-scale self-supervised models in-

cluding CPC (Oord et al.,2018), Wav2Vec (Schnei-

der et al.,2019), Wav2Vec2 (Baevski et al.,2020),

HuBERT (Hsu et al.,2021) and WavLM (Chen

et al.,2021) have learned effective speech repre-

sentations that can beneﬁt a wide range of down-

stream speech tasks (Yang et al.,2021). In par-

ticular, Hsu et al. (2021) proposed using cluster-

ing algorithm to cluster hidden states of HuBERT

into hidden units, which were then used to cre-

ate training masks. These clusters were shown

to encode rich phonemic information (Hsu et al.,

2021;Baevski et al.,2021). Later it is found that

discretizing speech into ‘hidden units’ allows the

application of NLP algorithms to process speech

via the proxy of these discrete hidden units, without

the need of actual textual transcriptions (‘textless

NLP’) (Lakhotia et al.,2021;Nguyen et al.,2022).

This discovery has greatly beneﬁted a variety of

tasks, some of which were traditionally not per-

formed with speech, including unsupervised ASR

(Baevski et al.,2021), spoken language modeling

(Ao et al.,2022;Hsu et al.,2021;Wu et al.,2022;

Nguyen et al.,2022), speech resynthesis (Polyak

et al.,2021), spoken language generation (Lakhotia

et al.,2021;Kharitonov et al.,2022b), speech-to-

speech translation (Lee et al.,2021;Popuri et al.,

2022;Lee et al.,2022;Wu et al.,2022), and spoken

named entity recognition (Wu et al.,2022). Follow-

ing these prior studies, our work also falls in the

domain of ‘textless NLP’.

Unsupervised sentence embeddings

Learning

semantic sentence embedding has been extensively

studied in NLP community, such as Skip-Thought

vectors (Kiros et al.,2015), InferSent (Conneau

et al.,2017), Universal Sentence Encoder (Cer

et al.,2018), and SBERT (Reimers and Gurevych,

2019). Recently, unsupervised sentence embed-

dings have considerably narrowed the performance

gap between unsupervised and supervised methods.

Contrastive learning has been utilized to learn a vec-

tor space in which semantically similar sentences

are close to each other, such as DeCLUTR (Giorgi

et al.,2021), SimCSE (Gao et al.,2021), TransEn-

coder (Liu et al.,2021) and DiffCSE (Chuang et al.,

2022). Another approach relies on autoencoders to

compress a sentence into a latent vector represen-

tation and then reconstruct the original sentence,

such as VGVAE (Chen et al.,2019) and TSDAE

(Wang et al.,2021).

Most unsupervised methods are based on tex-

tual sentences. In speech, current studies tend to

center on acoustic word embeddings (e.g., Kamper

et al.,2016;Settle and Livescu,2016;Settle et al.,

2017;Holzenberger et al.,2018;Kamper,2019).

Despite the progress, learning sentence-level em-

beddings for speechstill remains under-explored.

In SUPERB benchmark for evaluating speech rep-

resentations (Yang et al.,2021), spoken sentence

similarity ranking is not yet listed as a downstream

task. In recent works, it has been shown that spo-

ken sentence semantic similarities can be learned

via the visually grounded speech models (Merkx

et al.,2021). Multilingual spoken sentence embed-

dings can also be learned by using supervised mul-

tilingual text models as teacher models (Duquenne

et al.,2021;Khurana et al.,2022). These methods

more or less relied on labeled data such as speech-

image pairs or multilingual sentence pairs. How-

ever, we propose unsupervised methods to induce

semantic embeddings from speech signals only, and

our methods can also utilize textual transcriptions

to improve performance if they are available.

3 Method

Task formulation

The current task is to encode

spoken utterances into low dimensional dense

vectors such that semantically similar utterances

are close to each other in the learned latent

space. Given a speech signal

x∈R1×N=

[x1, x2, . . . , xN]

, our goal is to learn a neural net-

work function

fenc

that converts

to a ﬁxed-

dimensional vector

z∈Rd=fenc(x)

, such that

encodes the semantic content of the original

signal

. For a certain semantically similar pair

{

z,z+

} and a semantically dissimilar pair {

z,z−

}

(as determined by human raters), it is expected

that

sim(z,z+)> sim(z,z−)

, where

sim()

is a

similarity scoring function.

It is further assumed that some forms of tran-

scriptions of the original speech signal exist. Usu-

ally, a transcription of

take the form of a textual

sequence

y∈R1×M= [y1, y2, . . . , yM], N > M

Such data are sometimes available as most speech

datasets for ASR and TTS are organized as pairs

of speech and texts. However, in most scenarios

the textual transcriptions are not available or too

costly to create. In these cases, the transcriptions

can be in the form of pseudo-units

ˆy∈R1×L=

[ˆy1,ˆy2,...,ˆyL], N > L

, which could be generated

by an unsupervised system for acoustic unit dis-

covery. During training, these transcripts are used

as the targets for the proxy tasks. However, in in-

ference, the pretrained model can directly project

speech into semantic embeddings.

3.1 Discretizing speech signals

Acoustic unit discovery refers to the task of seg-

menting speech signals into discrete word-like or

phone-like units (e.g., Lee and Glass,2012;Lee

et al.,2015;Ondel et al.,2016;Kamper,2019;van

Niekerk et al.,2020). Annotating speech signals

can sometimes be prohibitively costly for many

languages and application domains. Unsupervised

discovery of acoustic units can be used as a proxy

of transcriptions to train speech systems, if the

discovered acoustic units are consistent representa-

tions of speech. In our approach, acoustic units are

treated as ‘pseudo-texts" to bootstrap the learning

of semantic representations.

We used pretrained speech transformer, Hu-

BERT, to discretize speech signals into ‘hidden

units’, which were proposed in Baevski et al. (2021)

and Lakhotia et al. (2021). After passing speech

signals into HuBERT, the hidden states of the sixth

layer were extracted and a k-means clustering algo-

rithm was applied on the hidden states to quantize

them into discrete clusters. The sequence of cluster

indexes, after deduplication by merging consecu-

tive same indexes, are the hidden units representing

the original speech (see Figure 1). The discrete

hidden units remove certain paralinguistic and non-

linguistic variations such as speaker voice traits

and background noises, so they can be considered

a normalized representations of the speech content

(though many phonetic variations are still present)

(Lee et al.,2021).

We used the

textless-lib

(Kharitonov

et al.,2022a) to convert speech signals

into discrete hidden units. We selected

hubert-base-ls960

as the base speech

encoder and set the number of clusters to 50,

100 and 200. After speech were discretized

into sequences of hidden units, sentence-piece

tokenizers (Kudo and Richardson,2018) were

trained on them to shorten the sequence length

(see Appendix A). There is evidence showing that

re-tokenzing hidden units are generally beneﬁcial

for language modeling and downstream tasks (Ren

et al.,2022;Wu et al.,2022).

3.2 S-HuBERT

The ﬁrst approach, S-HuBERT, is to transfer the

knowledge of a well-learned text embedding model

to a speech embedding model (Duquenne et al.,

2021;Khurana et al.,2022), in which pretrained

supervised textual embeddings are adopted as the

teacher models and speech models are trained to

align with the text embeddings in the same latent

space.

Here we also extend this approach to the unsu-

pervised learning domain. The proposed utilizes

an unsupervised sentence embedding model with

transcriptions, and then transfers the knowledge of

a textual sentence embedding model to an acoustic

sentence embedding model (S-HuBERT) by lever-

aging the correspondence between speech and its

transcriptions. In the absence of textual transcrip-

tions, the hidden units can be processed as pseudo-

texts to induce unsupervised meaning embeddings.

We mainly investigate two approaches to train

unsupervised (pseudo-)text embedding models,

namely, SimCSE (Gao et al.,2021) and TSDAE

(Wang et al.,2021). If these two types of models

are trained with hidden units, they are referred to as

Hu-SimCSE and Hu-TSDAE respectively, in order

to distinguish them from the text-based models.

SimCSE

The unsupervised SimCSE (Gao et al.,

2021) is a contrastive learning framework for tex-

tual sentence embeddings. It takes a sentence as

input and uses the same sentence as the target with

only the dropout noises. As pretrained transformers

such as BERT and RoBERTa apply a dropout mask

of 10%, the same sentence will result in slightly dif-

ferent hidden states in multiple passes and can be

treated as positive pairs in contrastive learning. We

trained SimCSE models to induce sentence mean-

ing from text transcripts before transferring the

knowledge to a speech model. For modeling hid-

den units, we ﬁrst pretrained a BERT model on

hidden units, which were converted from the whole

speech corpus. Then Hu-SimCSE was initiated

with the pretrained hidden-unit BERT for training.

TSDAE

Transformer-based Sequential Denois-

ing AutoEncoder (TSDAE) is a denoising encoder-

decoder model that encodes a corrupted text se-

quence into a dense vector and decodes the origi-

nal text sequence. We trained text-based TSDAE

models following as closely as possible the set-

tings speciﬁed by Wang et al. (2021). However,

slightly different hyperparameters were adopted

for Hu-TSDAE. In the original TSDAE, tokens in

the input sequence are randomly deleted with a

ratio of 0.6. We found that deleting tokens in the

input hidden units signiﬁcantly hurt performance

Instead, using the same uncorrupted sequence of

hidden units as both inputs and targets achieved

much better performance in our hyperparameter

tuning experiments (see Appendix D.1).

Language modeling on discrete units

Both

SimCSE and TSDAE models were intialized with

pretrained transformer checkpoints. In addition

to publicly available text-based pretrained mdoels,

we also pretrained hidden-unit based pretrained

transformers. Given a corpus of hidden units con-

verted from raw speech, transformer-based lan-

guage models were pretrained to learn the statisti-

cal regularities in sequences of hidden units. We

adopted the same model architecture as BERT (De-

vlin et al.,2019) (

bert-base-uncased

) and

used the masked language modeling task with a

masking rate of 15%. However, the next sentence

prediction task was discarded, because it was not

found to signiﬁcantly affect the model performance

(Liu et al.,2019;Lan et al.,2019).

Knowledge distillation

We transferred the

knowledge from a pretrained textual sentence em-

bedding model (SimCSE or TSDAE) into a speech

embedding model through teacher-student training

(Duquenne et al.,2021). Here the teacher model

was pretrained text embedding model whereas the

student model was the pretrained speech model,

HuBERT (Hsu et al.,2021).

We used contrastive learning for training S-

HuBERT (Sun et al.,2020;Wu et al.,2021;Ye

et al.,2022). Given a speech embedding

and its

corresponding (pseudo-)text embedding

with

in-batch negative samples, the InfoNCE loss (Oord

et al.,2018) is computed as

LinfoNCE =−log esim(zi,e

i)/τ

j=1 esim(zi,e

j)/τ (1)

where

is the temperature parameter and

sim()

is the cosine similarity function

sim(z1,z2) =

1z2/||z1|| · ||z2||

was set to 0.05 in all experi-

ments. In order to keep a large number of negative

samples, we maintained a dynamic memory bank

of negative samples (He et al.,2020). In each itera-

tion, textual representations in the last mini-batch

are enqueued into the memory bank, whereas the

oldest textual representations in the bank are de-

queued. The text model is frozen throughout train-

ing. A comparison of InfoNCE and MSE loss is

available at Table 14 in Appendix D.3.

3.3 WavEmbed

The WavEmbed is a sequential autoencoder (Vin-

cent et al.,2010;Hill et al.,2016;Wang et al.,

2021), which encodes a speech signal

into a

ﬁxed-dimensional vector

and decodes the vec-

tor representation

using only the encoded vector.

The vector

is used as the semantic representa-

tion. The decoded discrete representations can

be actual texts

or sequences of hidden acous-

tic units

ˆy

. The proposed method is inspired by

the TSDAE (Wang et al.,2021), which learns ef-

fective unsupervised sentence embeddings through

a denoising encoder-decoder model that encodes a

corrupted text sequence into a dense vector and de-

codes the uncorrupted one. WavEmbed generalizes

the original TSDAE to acoustic signals and can

learn semantic representations of speech through

reconstructing not only the texts but also the hidden

acoustic units discovered unsupervisedly.

Yet WavEmbed differs from TSDAE in some as-

pects. TSDAE’s encoder and decoder components

are all text-based, whereas WavEmbed utilizes a

speech encoder. TSDAE relies on the denoising

reconstruction as a proxy task, in which the model

is trained to recover the original sentence from the

embedding of the corrupted sentence (word dele-

tion with a ratio of 0.6). However, WavEmbed

reconstructs a discrete sequence from the embed-

ding of a corresponding spoken sentence but no

corruptions except the standard dropout is applied

to the speech signal. In addition, WavEmbed uses

self-attention pooling to pool the encoder hidden

states rather than the average pooling in TSDAE, as

self-attention pooling is more effective than mean

or max pooling for sentence-level speech emebd-

dings (see Khurana et al.,2022, and Table 14 in

Appendix D.3).

The encoder

fenc

consists of two parts, a pre-

trained speech transformer

for speech feature

extraction and a self-attention pooling layer for

pooling. Let

H∈RT×d= [h1,h2,...,hT]

the hidden states of a speech transformer model

given a speech signal

. The self-attention pooling

operation (Safari et al.,2020) can be computed as:

H=fS(x)(2)

z=Softmax(W HT)H(3)

where

W∈Rd

is a learnable parameter during

training. Given a semantic representation

speech signal

, the autoregressive decoder

fdec

predicts the hidden units

ˆy

that correspond to the

content of the speech signal x.

ˆy=fdec(z)(4)

The encoder-decoder model is trained with the stan-

dard negative log likelihood loss.

L=−

log P(ˆyl|z,ˆyl−1,...,ˆy1)

=−

log P(ˆyl|fenc(x),ˆyl−1,...,ˆy1)

(5)

The WavEmbed is trained to predict discrete acous-

tic units

ˆy

based on the speech signals

. However,

when textual transcripts for speech signals are avail-

able, the prediction targets can also be replaced

with textual sequences

to enhance the learning

of semantic content. Once the model is trained,

the decoder is discarded, leaving only the speech

encoder for extracting the semantic embeddings.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Bootstrappingmeaningthroughlistening:UnsupervisedlearningofspokensentenceembeddingsJianZhuB,÷,ZuoyuTianX,YadongLiu÷,CongZhangQ,Chia-wenLoMBUniversityofMichigan,AnnArbor÷UniversityofBritishColumbiaXIndianaUniversityBloomingtonQNewcastleUniversityMMaxPlanckInstituteforHumanCognitiveandBrainSciencesBli...

展开>> 收起<<

Bootstrapping meaning through listening Unsupervised learning of spoken sentence embeddings Jian ZhuB Zuoyu TianX Yadong Liu Cong ZhangQ Chia-wen LoM.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bootstrapping meaning through listening Unsupervised learning of spoken sentence embeddings Jian ZhuB Zuoyu TianX Yadong Liu Cong ZhangQ Chia-wen LoM

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: