OPENING THE BLACK BOX OF WA V2VEC FEATURE ENCODER Kwanghee Choi1 Eun Jung Yeo2 Department of Computer Science and Engineering Sogang University Republic of Korea1

2025-05-02 0 0 711.28KB 5 页 10玖币

侵权投诉

OPENING THE BLACK BOX OF WAV2VEC FEATURE ENCODER

Kwanghee Choi1∗, Eun Jung Yeo2∗

Department of Computer Science and Engineering, Sogang University, Republic of Korea1

Department of Linguistics, Seoul National University, Republic of Korea2

ABSTRACT

Self-supervised models, namely, wav2vec and its variants,

have shown promising results in various downstream tasks in

the speech domain. However, their inner workings are poorly

understood, calling for in-depth analyses on what the model

learns. In this paper, we concentrate on the convolutional

feature encoder where its latent space is often speculated to

represent discrete acoustic units. To analyze the embedding

space in a reductive manner, we feed the synthesized au-

dio signals, which is the summation of simple sine waves.

Through extensive experiments, we conclude that various

information is embedded inside the feature encoder repre-

sentations: (1) fundamental frequency, (2) formants, and (3)

amplitude, packed with (4) sufﬁcient temporal detail. Further,

the information incorporated inside the latent representations

is analogous to spectrograms but with a fundamental differ-

ence: latent representations construct a metric space so that

closer representations imply acoustic similarity.

Index Terms—Self-supervised Learning, Convolutional

Feature Encoder, Acoustic Modeling, Phonetic Analysis

1. INTRODUCTION

Large-scale self-supervised models, namely, wav2vec [1] and

its variants [2, 3], have succeeded in a wide range of down-

stream tasks, such as automatic speech recognition [2] and

keyword spotting [4]. Furthermore, recent advances demon-

strate that the model can handle general tasks such as emo-

tion recognition, environment classiﬁcation, and music tag-

ging, even with models pretrained with unlabeled speech data

[4, 5]. Concretely, self-supervised learning in wav2vec is de-

signed as partially hiding the sequential representations and

identifying the original representation based on its surround-

ings in a contrastive manner [1, 2]. Hence, one cannot help

but wonder about the acoustic capabilities of self-supervised

learning, as it does not explicitly guide the model to capture

the necessary information for speech analysis.

Though with subtle differences, wav2vec-like models usu-

ally consist of two components: the feature encoder based

on consecutive 1D convolutional layers and the transformer-

based contextual network [1]. It is commonly understood as

∗Equal contributors.

the convolutional feature encoder to embed acoustic features

where the contextual network handles global, abstract infor-

mation [2]. Based on this understanding, state-of-the-art mod-

els choose between passing the raw waveform through the 1D

convolutional feature encoder or directly inputting the spec-

trogram [4, 5]. However, the inner workings are yet poorly

understood, calling for in-depth analyses of what the convolu-

tional net actually encodes.

In this paper, we are motivated to explore whether the con-

volutional feature encoder can substitute spectrograms, which

is the basic tool in audio spectral analysis, showing frequency-

wise amplitude information with sufﬁcient temporal detail.

Hence, we start with the simplest form: the weighted sum of

different sine waves. By taking the reductionist approach, we

synthesize inputs block by block instead of using the recorded

speech signals. Our purpose is to provide a fundamental

understanding of how and why the encoder works so well.

We summarize the rest of the paper as follows. §2 covers

related works that analyze the inner workings of wav2vec-like

models by leveraging existing datasets. Then, we describe our

novel analysis method in §3.1. §3.2 explores the temporal

detail encoded in the representations, focusing on whether the

representations are consistent with different time steps. §3.3

further investigates the representation space with respect to

fundamental frequencies (F0s) and biases, comparing with

perceptual scales such as the Mel or the Bark scale. §3.4 dives

into the delicacies of formants aided with linguistic analyses

and compares different wav2vec variants. Finally, §3.5 covers

how amplitude differences impact latent representations. We

conclude our paper with in-depth discussions on comparing

spectrogram and feature encoder representations in §4.

2. RELATED WORKS

In the past decades, using learnable convolutional layers to

handle the raw audio signals has been thoroughly explored,

where the prevalent idea was to mimic or enhance the spec-

trogram. For instance, [6] observed frequency responses of

a few convolutional ﬁlters, and concluded that the convolu-

tional neural networks (CNNs) behave as ﬁlter matcher. The

follow-up paper showed that the early layers model the spectral

envelope [7]. In [8], the authors observed the activations di-

rectly, looking for correlations with the acoustic features such

arXiv:2210.15386v1 [cs.SD] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OPENINGTHEBLACKBOXOFWAV2VECFEATUREENCODERKwangheeChoi1,EunJungYeo2DepartmentofComputerScienceandEngineering,SogangUniversity,RepublicofKorea1DepartmentofLinguistics,SeoulNationalUniversity,RepublicofKorea2ABSTRACTSelf-supervisedmodels,namely,wav2vecanditsvariants,haveshownpromisingresultsinvarious...

展开>> 收起<<

OPENING THE BLACK BOX OF WA V2VEC FEATURE ENCODER Kwanghee Choi1 Eun Jung Yeo2 Department of Computer Science and Engineering Sogang University Republic of Korea1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OPENING THE BLACK BOX OF WA V2VEC FEATURE ENCODER Kwanghee Choi1 Eun Jung Yeo2 Department of Computer Science and Engineering Sogang University Republic of Korea1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: