EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS Themos Stafylakis1y Ladislav Mo ˇsner2y Sofoklis Kakouros3y Old ˇrich Plchot2 Luk aˇs Burget2 Jan ˇCernock y2

2025-05-06 0 0 291.41KB 8 页 10玖币

侵权投诉

EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED

SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS

Themos Stafylakis1†, Ladislav Moˇ

sner2†, Sofoklis Kakouros3†, Oldˇ

rich Plchot2, Luk´

aˇ

s Burget2, Jan ˇ

Cernock´

1Omilia - Conversational Intelligence, Athens, Greece

2Brno University of Technology, Faculty of Information Technology, Speech@FIT, Czechia

3University of Helsinki, Finland

ABSTRACT

Self-supervised learning of speech representations from large

amounts of unlabeled data has enabled state-of-the-art results

in several speech processing tasks. Aggregating these speech

representations across time is typically approached by using

descriptive statistics, and in particular, using the ﬁrst- and

second-order statistics of representation coefﬁcients. In this

paper, we examine an alternative way of extracting speaker

and emotion information from self-supervised trained mod-

els, based on the correlations between the coefﬁcients of the

representations — correlation pooling. We show improve-

ments over mean pooling and further gains when the pooling

methods are combined via fusion. The code is available at

github.com/Lamomal/s3prl_correlation.

Index Terms—Speaker identiﬁcation, speaker veriﬁca-

tion, emotion recognition, self-supervised models

1. INTRODUCTION

Large speech models trained in a self-supervised manner,

such as Wav2Vec 2.0, HuBERT, and WavLM, have shown

exceptional performance when ﬁnetuned on the downstream

tasks [1, 2, 3]. However, ﬁnetuning the weights of such

models on each task is a non-scalable solution for produc-

tion systems performing several of these downstream tasks in

real-time. For such systems, the preferable solution would be

to extract a single set of speech features from a shared model,

followed by a task-speciﬁc lightweight classiﬁer.

To this end, the Speech processing Universal PERfor-

mance Benchmark (SUPERB) challenge [4] was recently

introduced, with the goal of benchmarking the performance

of such speech models on a variety of speech tasks (e.g. ASR,

keyword spotting, spoken language understanding, emotion

recognition, speaker recognition, identiﬁcation and diariza-

tion, a.o.) keeping the models’ weights frozen and employing

a lightweight task-speciﬁc classiﬁer [4].

Among the interesting ﬁndings of SUPERB is the ability

of such models to encode speaker and emotion information

†Equal contribution

in the intermediate layers. The models are trained using an

implicitly phonetic loss (typically a masked-language model

style loss over quantized vector representations). It means

that speaker and emotion modeling is not directly encour-

aged. However, the models must be performing some kind of

speaker and emotion modeling in the intermediate layers, in

order to suppress this nuisance variability in the output layer.

For those tasks requiring utterance-level classiﬁcation or

representation learning, the SUPERB benchmark employs a

lightweight trainable classiﬁer incorporating pooling. The

pooling methods used are (a) mean pooling, and (b) statis-

tics pooling (concatenated mean and standard deviation, std,

vectors) in speaker veriﬁcation (SV), which is the standard

pooling method for x-vectors [5].

Although mean pooling typically yields good perfor-

mance in several speaker and emotion modeling tasks [6], we

should emphasize a crucial difference between these mod-

els and the SUPERB setup [4]; the fact that the models are

frozen, after being pretrained using a loss function that does

not directly encourage speaker or emotion modeling. As a

result, methods using mean pooling make simplifying as-

sumptions about the way the information is encoded into the

network’s internal representations.

Mean pooling, including the statistics pooling variant, as-

sumes that the correlations between different feature dimen-

sions (or channels) are of little or no importance. This might

be true if the model is trained or ﬁnetuned this way, i.e. with

a classiﬁer that extracts ﬁxed-length representations via mean

pooling (e.g. using a softmax classiﬁer with cross-entropy

loss). After all, deep neural networks have the capacity to ex-

tract information relevant to the task in a useful form. How-

ever, in the case of pretrained models, such an assumption is

at least questionable, and methods that take into account cor-

relations between feature channels should be considered.

In this paper, we focus on three SUPERB tasks that re-

quire sentence-level representations, namely speaker veriﬁca-

tion and identiﬁcation, and emotion recognition. We show

that a signiﬁcant portion of the information related to speaker

and emotion is encoded in the channel-wise correlations of

the intermediate layers. The idea is based on the correlation

arXiv:2210.09513v1 [eess.AS] 15 Oct 2022

pooling method originally introduced in [7]. However, the

network in [7], apart from having a very different architecture

(2D-ResNet), was not pretrained and/or frozen, but trained

from scratch for the speaker recognition task.

2. RELATED WORK

Correlations between channels have been explored in com-

puter vision, as a means to extract and/or modify the style and

the texture of images. The work of L.A. Gatys et al. [8] intro-

duced neural-style transfer, showing that such image charac-

teristics are captured by channel-wise correlations from Deep

Convolutional Nets (ConvNets) trained for object recognition

using ImageNet. Their method was adapted to speech gener-

ation and voice conversion system in [9], where the authors

demonstrated that intermediate layer representations encode

speaker characteristics.

The proposed pooling method was ﬁrst introduced in [7].

It was shown that 2D Deep Convolutional Nets (ResNet-34)

can be trained from scratch while using channel-wise cor-

relation pooling for frequency ranges, and the experiments

on VoxCeleb demonstrated improvements over the standard

statistics (mean-std) pooling. In this work, we extend this

method to pretrained self-supervised transformer models

(which are 1D since self-attention operates only across the

temporal axis) and we also test it on an emotion recognition.

3. SENTENCE-LEVEL REPRESENTATIONS IN

SUPERB

In this section, we describe the proposed correlation pooling

as well as several details related to the SUPERB challenge.

3.1. Transformer-based architectures

The most powerful self-supervised models follow the trans-

former architecture. The input features are extracted from

the waveform (at a rate of 50 fps) via a ConvNet, which is

trained jointly with the transformer. The ConvNet is typi-

cally frozen even in cases where the model is ﬁnetuned, as

it can easily overﬁt. A transformer layer follows the encoder

block architecture deﬁned in [10]. It consists of a multi-head

self-attention layer, followed by a feed-forward layer, while

layer-normalization is added in both layers. Critically, skip

connections are added between these layers, as in ResNets.

An interesting property of architectures equipped with skip

connections is that the correspondence between units of the

representations of different layers is preserved (i.e. the rep-

resentations are aligned). Each layer adds further contextual-

ization (via self-attention in the case of transformers) and de-

tails needed for optimizing the task deﬁned by the loss func-

tion (e.g. modeling and subsequently suppressing nuisance

variabilities, such as speaker, noise, channel, and emotion).

However, the ith unit of the lth layer’s representation captures

a similar characteristic with the ith unit of all other layers.

3.2. Layerwise pooling

The alignment between representations from different layers

permits an easy way of extracting information relevant to the

downstream task from all (i.e. both output and intermediate)

representations, by collapsing them into a single one via a

weighted average. The weights are learned jointly with the

task-speciﬁc classiﬁcation network. More concretely, the av-

eraged representation for a transformer with Llayers is ex-

pressed as

ht=

l=0

γlht,l,(1)

where the weights PL

l=0 γl= 1, γl≥0are implemented

with a learnable vector of size L+ 1, followed by the softmax

function, and ht,l is the representation of the lth layer at time

t(ht,0is the output of the ConvNet).

Note that collapsing all L+1 representations into a single

one via a simple weighted averaging would not make sense

for networks without skip connections, even if their sizes

were the same, unless the models were trained with a loss

deﬁned on layerwise-averaged representation (which is not

the case here, since the loss of the self-supervised models

is applied to the output layer). Exploring all representations

would require either concatenation along the feature dimen-

sion (increasing the size of the latter by a factor or L+1) or an

exhaustive search (i.e. training a different classiﬁer for each

of the L+ 1 layers) for ﬁnding the single most informative

representation for each task.

The SUPERB protocol suggests the weighted-average

type of layer-wise pooling for evaluating different models

and tasks, and so do we in this work. Note that this type

of layer-wise pooling was also used in ELMo, a bidirec-

tional LSTM-based language model with skip connections,

designed to extract deep contextualized word representa-

tions [11].

3.3. Mean pooling

Tasks requiring a sentence-level classiﬁcation typically em-

ploy a pooling method, such as mean, max or attentive pool-

ing. Mean pooling, which is employed in SUPERB is deﬁned

r=¯

h=1

t=1

ht,(2)

where Tis the number of acoustic features of an utterance

extracted by the ConvNet, ris the resulting pooled represen-

tation, while htare the representations at time tafter layer-

wise pooling. Concatenating the pooled representations with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EXTRACTINGSPEAKERANDEMOTIONINFORMATIONFROMSELF-SUPERVISEDSPEECHMODELSVIACHANNEL-WISECORRELATIONSThemosStafylakis1y,LadislavMosner2y,SofoklisKakouros3y,OldrichPlchot2,Luk´asBurget2,JanCernock´y21Omilia-ConversationalIntelligence,Athens,Greece2BrnoUniversityofTechnology,FacultyofInformationTechnol...

展开>> 收起<<

EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS Themos Stafylakis1y Ladislav Mo ˇsner2y Sofoklis Kakouros3y Old ˇrich Plchot2 Luk aˇs Burget2 Jan ˇCernock y2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS Themos Stafylakis1y Ladislav Mo ˇsner2y Sofoklis Kakouros3y Old ˇrich Plchot2 Luk aˇs Burget2 Jan ˇCernock y2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: