EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL A STUDY ON EMOTIONAL CORPORA Yuanchao Li Yumnah Mohamied Peter Bell Catherine Lai

2025-04-27 0 0 1.44MB 8 页 10玖币

侵权投诉

EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL: A STUDY ON EMOTIONAL

CORPORA

Yuanchao Li, Yumnah Mohamied, Peter Bell, Catherine Lai

Centre for Speech Technology Research, University of Edinburgh

{yuanchao.li, ymohamie, peter.bell, c.lai}@ed.ac.uk

ABSTRACT

Self-supervised speech models have grown fast during the

past few years and have proven feasible for use in various

downstream tasks. Some recent work has started to look at

the characteristics of these models, yet many concerns have

not been fully addressed. In this work, we conduct a study on

emotional corpora to explore a popular self-supervised model

– wav2vec 2.0. Via a set of quantitative analysis, we mainly

demonstrate that: 1) wav2vec 2.0 appears to discard paralin-

guistic information that is less useful for word recognition

purposes; 2) for emotion recognition, representations from

the middle layer alone perform as well as those derived from

layer averaging, while the ﬁnal layer results in the worst per-

formance in some cases; 3) current self-supervised models

may not be the optimal solution for downstream tasks that

make use of non-lexical features. Our work provides novel

ﬁndings that will aid future research in this area and theoreti-

cal basis for the use of existing models.

Index Terms—wav2vec 2.0, self-supervised learning,

speech emotion, speech recognition, paralinguistics

1. INTRODUCTION

Choosing the right features is a priority in machine learning-

based speech tasks. How much target information the fea-

tures contain fundamentally determines how well a model

will work. There are a large number of features to represent

and explain the complexity and variability of speech signals

in extensive multi-disciplinary studies [1, 2, 3]. Task spe-

ciﬁc features have been widely and effectively used in vari-

ous speech tasks. For example, cepstral features such as Mel-

Frequency Cepstral Coefﬁcients (MFCC), Linear-Frequency

Cepstral Coefﬁcients (LFCC), and Perceptual Linear Predic-

tion (PLP) cepstral coefﬁcients dominated Automatic Speech

Recognition (ASR) for many years [4, 5]. Similarly, other

speech tasks have their own preferred feature sets. In Speech

Emotion Recognition (SER), surprasegmental features, such

as pitch, energy, speaking rate [6, 7], have proven more help-

ful than information about phonetic segments. Aspects of

speech which are often discarded in automatic transcription,

such as disﬂuencies, are also known to be helpful in tasks such

as SER [8]. The same situation is true for other tasks, such as

dialog act detection [9], which leads to handcrafted engineer-

ing to understand the contributions of various features.

On the other hand, directly learning feature mappings

from speech signals without handcrafted engineering has

emerged as a trend during the past decade. Such End-to-End

(E2E) approaches beneﬁt from the success of deep learning

technologies and have proven useful in many speech tasks,

including ASR, SER, speaker veriﬁcation, and disorder clas-

siﬁcation [10, 11, 12, 13]. The E2E approach eliminates the

separate step of feature extraction and enables joint train-

ing of multiple tasks due to shared representations. This

can allow the models to learn feature spaces that are more

representative of the actual task than handcrafted features.

Inspired by the success of Self-Supervised Learning

(SSL) in natural language processing [14, 15], work on

addressing the general lack of task-speciﬁc labeled speech

data has accelerated in the past few years. Most of these ap-

proaches can be divided into generative modeling approaches

and discriminative modeling approaches [16, 17, 18]. SSL

utilizes information extracted from the input data itself as

the label to learn to encode general-purpose representations.

These pre-trained upstream models have proven effective

for downstream speech tasks, including speaker veriﬁcation

[19, 20] and SER [21]. However, what these models are actu-

ally learning is still understudied and questions and concerns

remain about why and how these models beneﬁt downstream

tasks: Are the generated representations optimal for every

task? How to utilize them for different purposes?

With these questions in mind, we study wav2vec 2.0 [17]

on emotional corpora, demonstrating how this type of self-

supervised model can be explored for downstream tasks. Our

experiments show that: 1) wav2vec 2.0 appears to discard

some paralinguistic information that is less useful for word

recognition purposes and does not treat all emotions and par-

alinguistic features equally; 2) for SER, representations from

the ﬁnal layer could result in the worst performance in some

cases; 3) current self-supervised models need to be carefully

ﬁne-tuned to adapt to downstream tasks that make use of non-

lexical features. We hope our ﬁndings can provide the re-

search community with a new perspective to look at the ef-

fectiveness and usage of self-supervised models.

arXiv:2210.02595v3 [eess.AS] 12 Dec 2022

2. RELATED WORK

There is no doubt that large-scale speech models using SSL

are becoming integral in speech processing tasks. Most of

them can be divided into generative or discriminative ap-

proaches. The generative approaches generate future frames

from past frames, or masked frames from unmasked frames

by learning to minimize reconstruction loss [16, 22, 23]. On

the other hand, the discriminative approaches discriminate

positive samples from negative samples while minimizing

contrastive prediction loss [24, 17, 18]. These self-supervised

models are generally trained on Librispeech [25], a corpus

based on public domain audio books primarily used for ASR

research. Although self-supervised objectives are general, the

design of popular SSL models has been primarily driven by

the goal of improving automatic transcription.

Unlike traditional speech modeling approaches that have

been extensively researched, these SSL models have just

started to be explored in very recent years, with wav2vec

2.0 (W2V2) attracting the most attention for its wide appli-

cation potential. For example, Pasad et al. [26] conducted

layer-wise analysis of W2V2 using a suite of tools and found

1) acoustic and linguistic properties are encoded in different

layers; 2) the pre-trained model follows an autoencoder-style

behavior; 3) the model encodes some non-trivial word mean-

ing information. Fan et al. [20] showed that W2V2 has the

ability to discriminate between speakers and also languages,

and this distinction is more obvious in lower layers. They

hence proposed multi-task learning of speaker veriﬁcation

and language identiﬁcation, and veriﬁed its feasibility. Li et

al. [27] noticed the recognition of longer emotional utter-

ances that contain more contextual information beneﬁts from

the contextual characteristic of W2V2. They proposed a joint

training scheme by hierarchically fusing multiple W2V2 out-

puts for SER. Yang et al. [28] set up benchmark performance

using self-supervised speech models on a range of tasks.

Nevertheless, these self-supervised speech models are

still understudied and the above-mentioned works have lim-

itations. For example, [26] did not extend their exploration

to non-ASR downstream tasks. In [20] and [27], only a por-

tion of the layer difference was shown, so misses a thorough

layer-wise analysis. In [28], they presented downstream task

performance without further explanation. Furthermore, none

of those studies investigated paralinguistic characteristics in

W2V2 representations. As such, in the current work, we

build on previous work while adding new perspectives from

detailed quantitative analysis on emotional corpora.

3. CORPORA AND MODEL DESCRIPTION

3.1. Corpora

IEMOCAP (IEM) [29] has ﬁve dyadic sessions with ten ac-

tors (ﬁve male and ﬁve female), each with a scripted and im-

provised multimodal interaction. The corpus consists of ap-

proximately 12 hours of speech that has been annotated by

three annotators with ten emotion classes. Following prior re-

search [27], we combined Happy and Excited, and removed

utterances that do not have transcripts, bringing the total num-

ber of utterances used in this study to 5,500, each with one

label from four classes: Angry, Happy, Neutral, and Sad.

RAVDESS (RAV) [30] contains a speech set and a song set.

We only use the speech set, which has 1,440 utterances from

24 actors (12 female, 12 male) in eight emotions: Calm,

Happy, Sad, Angry, Fear, Surprise, Disgust, and Neutral.

Ratings were provided by untrained individuals. In the pro-

cess of collecting data, the actors spoke two ﬁxed sentences

with different classes of emotion, so the corpus has a good

balance of emotions. The actors were given two trials for

each utterance and asked to produce 60 speech clips in total.

The major reason that we choose to use RAV is that, even

though other corpora may have a larger size, it provides ﬁxed

sentences with different emotional expressions. Such a set-

ting excludes the lexical inﬂuence by “forcing” different emo-

tions to have the same linguistic content, thus helping us to

better explore the effects of the acoustic properties of W2V2

by eliminating the effects raised by lexical content (e.g., word

pronunciation causing prosody variation).

3.2. Model

We look at W2V2 [17], a SSL framework comprised of three

major components: a CNN-based local encoder that extracts a

sequence of embeddings from raw audio as latent speech rep-

resentation Z, a Transformer network for obtaining context

representation C, and a quantization module for discretiz-

ing Zinto Q. Following previous work [26], we focus our

attention on the latent representations learned by the Trans-

former module of W2V2. In this work, we use wav2vec2-

base,wav2vec2-base-100h, and wav2vec2-base-960h mod-

els, which are the pre-trained and ﬁne-tuned models (on 100h

and 960h of Librispeech) respectively. We refer to them as

PT,FT100, and FT960. We choose W2V2 because it is the

most widely used SSL speech model, with the expectation

that the exploratory approach can be generalized to similar

SSL models.

4. EXPERIMENTS AND RESULTS

We perform a set of probing experiments, including the fol-

lowing quantitative measures:

Probing SER performance. We ﬁrst implement a layer-wise

analysis by using the output of every individual layer within

the Transformer network to demonstrate how information en-

coded by W2V2 contributes to SER. Next, as there is no com-

mon practice of how to utilize W2V2 representations as input

features for downstream tasks, we compare the performance

of three commonly used approaches of using W2V2 repre-

sentations as input features: 1) taking the last layer output

[31, 32, 33]; 2) taking the average of all layer outputs [34];

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EXPLORATIONOFASELF-SUPERVISEDSPEECHMODEL:ASTUDYONEMOTIONALCORPORAYuanchaoLi,YumnahMohamied,PeterBell,CatherineLaiCentreforSpeechTechnologyResearch,UniversityofEdinburghfyuanchao.li,ymohamie,peter.bell,c.laig@ed.ac.ukABSTRACTSelf-supervisedspeechmodelshavegrownfastduringthepastfewyearsandhaveprovenfe...

展开>> 收起<<

EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL A STUDY ON EMOTIONAL CORPORA Yuanchao Li Yumnah Mohamied Peter Bell Catherine Lai.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EXPLORATION OF A SELF-SUPERVISED SPEECH MODEL A STUDY ON EMOTIONAL CORPORA Yuanchao Li Yumnah Mohamied Peter Bell Catherine Lai

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: