EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL A STUDY ON EMOTIONAL CORPORA Yuanchao Li Yumnah Mohamied Peter Bell Catherine Lai

2025-04-27 0 0 1.44MB 8 页 10玖币
侵权投诉
EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL: A STUDY ON EMOTIONAL
CORPORA
Yuanchao Li, Yumnah Mohamied, Peter Bell, Catherine Lai
Centre for Speech Technology Research, University of Edinburgh
{yuanchao.li, ymohamie, peter.bell, c.lai}@ed.ac.uk
ABSTRACT
Self-supervised speech models have grown fast during the
past few years and have proven feasible for use in various
downstream tasks. Some recent work has started to look at
the characteristics of these models, yet many concerns have
not been fully addressed. In this work, we conduct a study on
emotional corpora to explore a popular self-supervised model
– wav2vec 2.0. Via a set of quantitative analysis, we mainly
demonstrate that: 1) wav2vec 2.0 appears to discard paralin-
guistic information that is less useful for word recognition
purposes; 2) for emotion recognition, representations from
the middle layer alone perform as well as those derived from
layer averaging, while the final layer results in the worst per-
formance in some cases; 3) current self-supervised models
may not be the optimal solution for downstream tasks that
make use of non-lexical features. Our work provides novel
findings that will aid future research in this area and theoreti-
cal basis for the use of existing models.
Index Termswav2vec 2.0, self-supervised learning,
speech emotion, speech recognition, paralinguistics
1. INTRODUCTION
Choosing the right features is a priority in machine learning-
based speech tasks. How much target information the fea-
tures contain fundamentally determines how well a model
will work. There are a large number of features to represent
and explain the complexity and variability of speech signals
in extensive multi-disciplinary studies [1, 2, 3]. Task spe-
cific features have been widely and effectively used in vari-
ous speech tasks. For example, cepstral features such as Mel-
Frequency Cepstral Coefficients (MFCC), Linear-Frequency
Cepstral Coefficients (LFCC), and Perceptual Linear Predic-
tion (PLP) cepstral coefficients dominated Automatic Speech
Recognition (ASR) for many years [4, 5]. Similarly, other
speech tasks have their own preferred feature sets. In Speech
Emotion Recognition (SER), surprasegmental features, such
as pitch, energy, speaking rate [6, 7], have proven more help-
ful than information about phonetic segments. Aspects of
speech which are often discarded in automatic transcription,
such as disfluencies, are also known to be helpful in tasks such
as SER [8]. The same situation is true for other tasks, such as
dialog act detection [9], which leads to handcrafted engineer-
ing to understand the contributions of various features.
On the other hand, directly learning feature mappings
from speech signals without handcrafted engineering has
emerged as a trend during the past decade. Such End-to-End
(E2E) approaches benefit from the success of deep learning
technologies and have proven useful in many speech tasks,
including ASR, SER, speaker verification, and disorder clas-
sification [10, 11, 12, 13]. The E2E approach eliminates the
separate step of feature extraction and enables joint train-
ing of multiple tasks due to shared representations. This
can allow the models to learn feature spaces that are more
representative of the actual task than handcrafted features.
Inspired by the success of Self-Supervised Learning
(SSL) in natural language processing [14, 15], work on
addressing the general lack of task-specific labeled speech
data has accelerated in the past few years. Most of these ap-
proaches can be divided into generative modeling approaches
and discriminative modeling approaches [16, 17, 18]. SSL
utilizes information extracted from the input data itself as
the label to learn to encode general-purpose representations.
These pre-trained upstream models have proven effective
for downstream speech tasks, including speaker verification
[19, 20] and SER [21]. However, what these models are actu-
ally learning is still understudied and questions and concerns
remain about why and how these models benefit downstream
tasks: Are the generated representations optimal for every
task? How to utilize them for different purposes?
With these questions in mind, we study wav2vec 2.0 [17]
on emotional corpora, demonstrating how this type of self-
supervised model can be explored for downstream tasks. Our
experiments show that: 1) wav2vec 2.0 appears to discard
some paralinguistic information that is less useful for word
recognition purposes and does not treat all emotions and par-
alinguistic features equally; 2) for SER, representations from
the final layer could result in the worst performance in some
cases; 3) current self-supervised models need to be carefully
fine-tuned to adapt to downstream tasks that make use of non-
lexical features. We hope our findings can provide the re-
search community with a new perspective to look at the ef-
fectiveness and usage of self-supervised models.
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.02595v3 [eess.AS] 12 Dec 2022
2. RELATED WORK
There is no doubt that large-scale speech models using SSL
are becoming integral in speech processing tasks. Most of
them can be divided into generative or discriminative ap-
proaches. The generative approaches generate future frames
from past frames, or masked frames from unmasked frames
by learning to minimize reconstruction loss [16, 22, 23]. On
the other hand, the discriminative approaches discriminate
positive samples from negative samples while minimizing
contrastive prediction loss [24, 17, 18]. These self-supervised
models are generally trained on Librispeech [25], a corpus
based on public domain audio books primarily used for ASR
research. Although self-supervised objectives are general, the
design of popular SSL models has been primarily driven by
the goal of improving automatic transcription.
Unlike traditional speech modeling approaches that have
been extensively researched, these SSL models have just
started to be explored in very recent years, with wav2vec
2.0 (W2V2) attracting the most attention for its wide appli-
cation potential. For example, Pasad et al. [26] conducted
layer-wise analysis of W2V2 using a suite of tools and found
1) acoustic and linguistic properties are encoded in different
layers; 2) the pre-trained model follows an autoencoder-style
behavior; 3) the model encodes some non-trivial word mean-
ing information. Fan et al. [20] showed that W2V2 has the
ability to discriminate between speakers and also languages,
and this distinction is more obvious in lower layers. They
hence proposed multi-task learning of speaker verification
and language identification, and verified its feasibility. Li et
al. [27] noticed the recognition of longer emotional utter-
ances that contain more contextual information benefits from
the contextual characteristic of W2V2. They proposed a joint
training scheme by hierarchically fusing multiple W2V2 out-
puts for SER. Yang et al. [28] set up benchmark performance
using self-supervised speech models on a range of tasks.
Nevertheless, these self-supervised speech models are
still understudied and the above-mentioned works have lim-
itations. For example, [26] did not extend their exploration
to non-ASR downstream tasks. In [20] and [27], only a por-
tion of the layer difference was shown, so misses a thorough
layer-wise analysis. In [28], they presented downstream task
performance without further explanation. Furthermore, none
of those studies investigated paralinguistic characteristics in
W2V2 representations. As such, in the current work, we
build on previous work while adding new perspectives from
detailed quantitative analysis on emotional corpora.
3. CORPORA AND MODEL DESCRIPTION
3.1. Corpora
IEMOCAP (IEM) [29] has five dyadic sessions with ten ac-
tors (five male and five female), each with a scripted and im-
provised multimodal interaction. The corpus consists of ap-
proximately 12 hours of speech that has been annotated by
three annotators with ten emotion classes. Following prior re-
search [27], we combined Happy and Excited, and removed
utterances that do not have transcripts, bringing the total num-
ber of utterances used in this study to 5,500, each with one
label from four classes: Angry, Happy, Neutral, and Sad.
RAVDESS (RAV) [30] contains a speech set and a song set.
We only use the speech set, which has 1,440 utterances from
24 actors (12 female, 12 male) in eight emotions: Calm,
Happy, Sad, Angry, Fear, Surprise, Disgust, and Neutral.
Ratings were provided by untrained individuals. In the pro-
cess of collecting data, the actors spoke two fixed sentences
with different classes of emotion, so the corpus has a good
balance of emotions. The actors were given two trials for
each utterance and asked to produce 60 speech clips in total.
The major reason that we choose to use RAV is that, even
though other corpora may have a larger size, it provides fixed
sentences with different emotional expressions. Such a set-
ting excludes the lexical influence by “forcing” different emo-
tions to have the same linguistic content, thus helping us to
better explore the effects of the acoustic properties of W2V2
by eliminating the effects raised by lexical content (e.g., word
pronunciation causing prosody variation).
3.2. Model
We look at W2V2 [17], a SSL framework comprised of three
major components: a CNN-based local encoder that extracts a
sequence of embeddings from raw audio as latent speech rep-
resentation Z, a Transformer network for obtaining context
representation C, and a quantization module for discretiz-
ing Zinto Q. Following previous work [26], we focus our
attention on the latent representations learned by the Trans-
former module of W2V2. In this work, we use wav2vec2-
base,wav2vec2-base-100h, and wav2vec2-base-960h mod-
els, which are the pre-trained and fine-tuned models (on 100h
and 960h of Librispeech) respectively. We refer to them as
PT,FT100, and FT960. We choose W2V2 because it is the
most widely used SSL speech model, with the expectation
that the exploratory approach can be generalized to similar
SSL models.
4. EXPERIMENTS AND RESULTS
We perform a set of probing experiments, including the fol-
lowing quantitative measures:
Probing SER performance. We first implement a layer-wise
analysis by using the output of every individual layer within
the Transformer network to demonstrate how information en-
coded by W2V2 contributes to SER. Next, as there is no com-
mon practice of how to utilize W2V2 representations as input
features for downstream tasks, we compare the performance
of three commonly used approaches of using W2V2 repre-
sentations as input features: 1) taking the last layer output
[31, 32, 33]; 2) taking the average of all layer outputs [34];
摘要:

EXPLORATIONOFASELF-SUPERVISEDSPEECHMODEL:ASTUDYONEMOTIONALCORPORAYuanchaoLi,YumnahMohamied,PeterBell,CatherineLaiCentreforSpeechTechnologyResearch,UniversityofEdinburghfyuanchao.li,ymohamie,peter.bell,c.laig@ed.ac.ukABSTRACTSelf-supervisedspeechmodelshavegrownfastduringthepastfewyearsandhaveprovenfe...

展开>> 收起<<
EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL A STUDY ON EMOTIONAL CORPORA Yuanchao Li Yumnah Mohamied Peter Bell Catherine Lai.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.44MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注