OPENING THE BLACK BOX OF WA V2VEC FEATURE ENCODER Kwanghee Choi1 Eun Jung Yeo2 Department of Computer Science and Engineering Sogang University Republic of Korea1

2025-05-02 0 0 711.28KB 5 页 10玖币
侵权投诉
OPENING THE BLACK BOX OF WAV2VEC FEATURE ENCODER
Kwanghee Choi1, Eun Jung Yeo2
Department of Computer Science and Engineering, Sogang University, Republic of Korea1
Department of Linguistics, Seoul National University, Republic of Korea2
ABSTRACT
Self-supervised models, namely, wav2vec and its variants,
have shown promising results in various downstream tasks in
the speech domain. However, their inner workings are poorly
understood, calling for in-depth analyses on what the model
learns. In this paper, we concentrate on the convolutional
feature encoder where its latent space is often speculated to
represent discrete acoustic units. To analyze the embedding
space in a reductive manner, we feed the synthesized au-
dio signals, which is the summation of simple sine waves.
Through extensive experiments, we conclude that various
information is embedded inside the feature encoder repre-
sentations: (1) fundamental frequency, (2) formants, and (3)
amplitude, packed with (4) sufficient temporal detail. Further,
the information incorporated inside the latent representations
is analogous to spectrograms but with a fundamental differ-
ence: latent representations construct a metric space so that
closer representations imply acoustic similarity.
Index TermsSelf-supervised Learning, Convolutional
Feature Encoder, Acoustic Modeling, Phonetic Analysis
1. INTRODUCTION
Large-scale self-supervised models, namely, wav2vec [1] and
its variants [2, 3], have succeeded in a wide range of down-
stream tasks, such as automatic speech recognition [2] and
keyword spotting [4]. Furthermore, recent advances demon-
strate that the model can handle general tasks such as emo-
tion recognition, environment classification, and music tag-
ging, even with models pretrained with unlabeled speech data
[4, 5]. Concretely, self-supervised learning in wav2vec is de-
signed as partially hiding the sequential representations and
identifying the original representation based on its surround-
ings in a contrastive manner [1, 2]. Hence, one cannot help
but wonder about the acoustic capabilities of self-supervised
learning, as it does not explicitly guide the model to capture
the necessary information for speech analysis.
Though with subtle differences, wav2vec-like models usu-
ally consist of two components: the feature encoder based
on consecutive 1D convolutional layers and the transformer-
based contextual network [1]. It is commonly understood as
Equal contributors.
the convolutional feature encoder to embed acoustic features
where the contextual network handles global, abstract infor-
mation [2]. Based on this understanding, state-of-the-art mod-
els choose between passing the raw waveform through the 1D
convolutional feature encoder or directly inputting the spec-
trogram [4, 5]. However, the inner workings are yet poorly
understood, calling for in-depth analyses of what the convolu-
tional net actually encodes.
In this paper, we are motivated to explore whether the con-
volutional feature encoder can substitute spectrograms, which
is the basic tool in audio spectral analysis, showing frequency-
wise amplitude information with sufficient temporal detail.
Hence, we start with the simplest form: the weighted sum of
different sine waves. By taking the reductionist approach, we
synthesize inputs block by block instead of using the recorded
speech signals. Our purpose is to provide a fundamental
understanding of how and why the encoder works so well.
We summarize the rest of the paper as follows. §2 covers
related works that analyze the inner workings of wav2vec-like
models by leveraging existing datasets. Then, we describe our
novel analysis method in §3.1. §3.2 explores the temporal
detail encoded in the representations, focusing on whether the
representations are consistent with different time steps. §3.3
further investigates the representation space with respect to
fundamental frequencies (F0s) and biases, comparing with
perceptual scales such as the Mel or the Bark scale. §3.4 dives
into the delicacies of formants aided with linguistic analyses
and compares different wav2vec variants. Finally, §3.5 covers
how amplitude differences impact latent representations. We
conclude our paper with in-depth discussions on comparing
spectrogram and feature encoder representations in §4.
2. RELATED WORKS
In the past decades, using learnable convolutional layers to
handle the raw audio signals has been thoroughly explored,
where the prevalent idea was to mimic or enhance the spec-
trogram. For instance, [6] observed frequency responses of
a few convolutional filters, and concluded that the convolu-
tional neural networks (CNNs) behave as filter matcher. The
follow-up paper showed that the early layers model the spectral
envelope [7]. In [8], the authors observed the activations di-
rectly, looking for correlations with the acoustic features such
arXiv:2210.15386v1 [cs.SD] 27 Oct 2022
摘要:

OPENINGTHEBLACKBOXOFWAV2VECFEATUREENCODERKwangheeChoi1,EunJungYeo2DepartmentofComputerScienceandEngineering,SogangUniversity,RepublicofKorea1DepartmentofLinguistics,SeoulNationalUniversity,RepublicofKorea2ABSTRACTSelf-supervisedmodels,namely,wav2vecanditsvariants,haveshownpromisingresultsinvarious...

展开>> 收起<<
OPENING THE BLACK BOX OF WA V2VEC FEATURE ENCODER Kwanghee Choi1 Eun Jung Yeo2 Department of Computer Science and Engineering Sogang University Republic of Korea1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:711.28KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注