
OPENING THE BLACK BOX OF WAV2VEC FEATURE ENCODER
Kwanghee Choi1∗, Eun Jung Yeo2∗
Department of Computer Science and Engineering, Sogang University, Republic of Korea1
Department of Linguistics, Seoul National University, Republic of Korea2
ABSTRACT
Self-supervised models, namely, wav2vec and its variants,
have shown promising results in various downstream tasks in
the speech domain. However, their inner workings are poorly
understood, calling for in-depth analyses on what the model
learns. In this paper, we concentrate on the convolutional
feature encoder where its latent space is often speculated to
represent discrete acoustic units. To analyze the embedding
space in a reductive manner, we feed the synthesized au-
dio signals, which is the summation of simple sine waves.
Through extensive experiments, we conclude that various
information is embedded inside the feature encoder repre-
sentations: (1) fundamental frequency, (2) formants, and (3)
amplitude, packed with (4) sufficient temporal detail. Further,
the information incorporated inside the latent representations
is analogous to spectrograms but with a fundamental differ-
ence: latent representations construct a metric space so that
closer representations imply acoustic similarity.
Index Terms—Self-supervised Learning, Convolutional
Feature Encoder, Acoustic Modeling, Phonetic Analysis
1. INTRODUCTION
Large-scale self-supervised models, namely, wav2vec [1] and
its variants [2, 3], have succeeded in a wide range of down-
stream tasks, such as automatic speech recognition [2] and
keyword spotting [4]. Furthermore, recent advances demon-
strate that the model can handle general tasks such as emo-
tion recognition, environment classification, and music tag-
ging, even with models pretrained with unlabeled speech data
[4, 5]. Concretely, self-supervised learning in wav2vec is de-
signed as partially hiding the sequential representations and
identifying the original representation based on its surround-
ings in a contrastive manner [1, 2]. Hence, one cannot help
but wonder about the acoustic capabilities of self-supervised
learning, as it does not explicitly guide the model to capture
the necessary information for speech analysis.
Though with subtle differences, wav2vec-like models usu-
ally consist of two components: the feature encoder based
on consecutive 1D convolutional layers and the transformer-
based contextual network [1]. It is commonly understood as
∗Equal contributors.
the convolutional feature encoder to embed acoustic features
where the contextual network handles global, abstract infor-
mation [2]. Based on this understanding, state-of-the-art mod-
els choose between passing the raw waveform through the 1D
convolutional feature encoder or directly inputting the spec-
trogram [4, 5]. However, the inner workings are yet poorly
understood, calling for in-depth analyses of what the convolu-
tional net actually encodes.
In this paper, we are motivated to explore whether the con-
volutional feature encoder can substitute spectrograms, which
is the basic tool in audio spectral analysis, showing frequency-
wise amplitude information with sufficient temporal detail.
Hence, we start with the simplest form: the weighted sum of
different sine waves. By taking the reductionist approach, we
synthesize inputs block by block instead of using the recorded
speech signals. Our purpose is to provide a fundamental
understanding of how and why the encoder works so well.
We summarize the rest of the paper as follows. §2 covers
related works that analyze the inner workings of wav2vec-like
models by leveraging existing datasets. Then, we describe our
novel analysis method in §3.1. §3.2 explores the temporal
detail encoded in the representations, focusing on whether the
representations are consistent with different time steps. §3.3
further investigates the representation space with respect to
fundamental frequencies (F0s) and biases, comparing with
perceptual scales such as the Mel or the Bark scale. §3.4 dives
into the delicacies of formants aided with linguistic analyses
and compares different wav2vec variants. Finally, §3.5 covers
how amplitude differences impact latent representations. We
conclude our paper with in-depth discussions on comparing
spectrogram and feature encoder representations in §4.
2. RELATED WORKS
In the past decades, using learnable convolutional layers to
handle the raw audio signals has been thoroughly explored,
where the prevalent idea was to mimic or enhance the spec-
trogram. For instance, [6] observed frequency responses of
a few convolutional filters, and concluded that the convolu-
tional neural networks (CNNs) behave as filter matcher. The
follow-up paper showed that the early layers model the spectral
envelope [7]. In [8], the authors observed the activations di-
rectly, looking for correlations with the acoustic features such
arXiv:2210.15386v1 [cs.SD] 27 Oct 2022