EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS Themos Stafylakis1y Ladislav Mo ˇsner2y Sofoklis Kakouros3y Old ˇrich Plchot2 Luk aˇs Burget2 Jan ˇCernock y2

2025-05-06 0 0 291.41KB 8 页 10玖币
侵权投诉
EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED
SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS
Themos Stafylakis1, Ladislav Moˇ
sner2, Sofoklis Kakouros3, Oldˇ
rich Plchot2, Luk´
aˇ
s Burget2, Jan ˇ
Cernock´
y2
1Omilia - Conversational Intelligence, Athens, Greece
2Brno University of Technology, Faculty of Information Technology, Speech@FIT, Czechia
3University of Helsinki, Finland
ABSTRACT
Self-supervised learning of speech representations from large
amounts of unlabeled data has enabled state-of-the-art results
in several speech processing tasks. Aggregating these speech
representations across time is typically approached by using
descriptive statistics, and in particular, using the first- and
second-order statistics of representation coefficients. In this
paper, we examine an alternative way of extracting speaker
and emotion information from self-supervised trained mod-
els, based on the correlations between the coefficients of the
representations — correlation pooling. We show improve-
ments over mean pooling and further gains when the pooling
methods are combined via fusion. The code is available at
github.com/Lamomal/s3prl_correlation.
Index TermsSpeaker identification, speaker verifica-
tion, emotion recognition, self-supervised models
1. INTRODUCTION
Large speech models trained in a self-supervised manner,
such as Wav2Vec 2.0, HuBERT, and WavLM, have shown
exceptional performance when finetuned on the downstream
tasks [1, 2, 3]. However, finetuning the weights of such
models on each task is a non-scalable solution for produc-
tion systems performing several of these downstream tasks in
real-time. For such systems, the preferable solution would be
to extract a single set of speech features from a shared model,
followed by a task-specific lightweight classifier.
To this end, the Speech processing Universal PERfor-
mance Benchmark (SUPERB) challenge [4] was recently
introduced, with the goal of benchmarking the performance
of such speech models on a variety of speech tasks (e.g. ASR,
keyword spotting, spoken language understanding, emotion
recognition, speaker recognition, identification and diariza-
tion, a.o.) keeping the models’ weights frozen and employing
a lightweight task-specific classifier [4].
Among the interesting findings of SUPERB is the ability
of such models to encode speaker and emotion information
†Equal contribution
in the intermediate layers. The models are trained using an
implicitly phonetic loss (typically a masked-language model
style loss over quantized vector representations). It means
that speaker and emotion modeling is not directly encour-
aged. However, the models must be performing some kind of
speaker and emotion modeling in the intermediate layers, in
order to suppress this nuisance variability in the output layer.
For those tasks requiring utterance-level classification or
representation learning, the SUPERB benchmark employs a
lightweight trainable classifier incorporating pooling. The
pooling methods used are (a) mean pooling, and (b) statis-
tics pooling (concatenated mean and standard deviation, std,
vectors) in speaker verification (SV), which is the standard
pooling method for x-vectors [5].
Although mean pooling typically yields good perfor-
mance in several speaker and emotion modeling tasks [6], we
should emphasize a crucial difference between these mod-
els and the SUPERB setup [4]; the fact that the models are
frozen, after being pretrained using a loss function that does
not directly encourage speaker or emotion modeling. As a
result, methods using mean pooling make simplifying as-
sumptions about the way the information is encoded into the
network’s internal representations.
Mean pooling, including the statistics pooling variant, as-
sumes that the correlations between different feature dimen-
sions (or channels) are of little or no importance. This might
be true if the model is trained or finetuned this way, i.e. with
a classifier that extracts fixed-length representations via mean
pooling (e.g. using a softmax classifier with cross-entropy
loss). After all, deep neural networks have the capacity to ex-
tract information relevant to the task in a useful form. How-
ever, in the case of pretrained models, such an assumption is
at least questionable, and methods that take into account cor-
relations between feature channels should be considered.
In this paper, we focus on three SUPERB tasks that re-
quire sentence-level representations, namely speaker verifica-
tion and identification, and emotion recognition. We show
that a significant portion of the information related to speaker
and emotion is encoded in the channel-wise correlations of
the intermediate layers. The idea is based on the correlation
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.09513v1 [eess.AS] 15 Oct 2022
pooling method originally introduced in [7]. However, the
network in [7], apart from having a very different architecture
(2D-ResNet), was not pretrained and/or frozen, but trained
from scratch for the speaker recognition task.
2. RELATED WORK
Correlations between channels have been explored in com-
puter vision, as a means to extract and/or modify the style and
the texture of images. The work of L.A. Gatys et al. [8] intro-
duced neural-style transfer, showing that such image charac-
teristics are captured by channel-wise correlations from Deep
Convolutional Nets (ConvNets) trained for object recognition
using ImageNet. Their method was adapted to speech gener-
ation and voice conversion system in [9], where the authors
demonstrated that intermediate layer representations encode
speaker characteristics.
The proposed pooling method was first introduced in [7].
It was shown that 2D Deep Convolutional Nets (ResNet-34)
can be trained from scratch while using channel-wise cor-
relation pooling for frequency ranges, and the experiments
on VoxCeleb demonstrated improvements over the standard
statistics (mean-std) pooling. In this work, we extend this
method to pretrained self-supervised transformer models
(which are 1D since self-attention operates only across the
temporal axis) and we also test it on an emotion recognition.
3. SENTENCE-LEVEL REPRESENTATIONS IN
SUPERB
In this section, we describe the proposed correlation pooling
as well as several details related to the SUPERB challenge.
3.1. Transformer-based architectures
The most powerful self-supervised models follow the trans-
former architecture. The input features are extracted from
the waveform (at a rate of 50 fps) via a ConvNet, which is
trained jointly with the transformer. The ConvNet is typi-
cally frozen even in cases where the model is finetuned, as
it can easily overfit. A transformer layer follows the encoder
block architecture defined in [10]. It consists of a multi-head
self-attention layer, followed by a feed-forward layer, while
layer-normalization is added in both layers. Critically, skip
connections are added between these layers, as in ResNets.
An interesting property of architectures equipped with skip
connections is that the correspondence between units of the
representations of different layers is preserved (i.e. the rep-
resentations are aligned). Each layer adds further contextual-
ization (via self-attention in the case of transformers) and de-
tails needed for optimizing the task defined by the loss func-
tion (e.g. modeling and subsequently suppressing nuisance
variabilities, such as speaker, noise, channel, and emotion).
However, the ith unit of the lth layer’s representation captures
a similar characteristic with the ith unit of all other layers.
3.2. Layerwise pooling
The alignment between representations from different layers
permits an easy way of extracting information relevant to the
downstream task from all (i.e. both output and intermediate)
representations, by collapsing them into a single one via a
weighted average. The weights are learned jointly with the
task-specific classification network. More concretely, the av-
eraged representation for a transformer with Llayers is ex-
pressed as
ht=
L
X
l=0
γlht,l,(1)
where the weights PL
l=0 γl= 1, γl0are implemented
with a learnable vector of size L+ 1, followed by the softmax
function, and ht,l is the representation of the lth layer at time
t(ht,0is the output of the ConvNet).
Note that collapsing all L+1 representations into a single
one via a simple weighted averaging would not make sense
for networks without skip connections, even if their sizes
were the same, unless the models were trained with a loss
defined on layerwise-averaged representation (which is not
the case here, since the loss of the self-supervised models
is applied to the output layer). Exploring all representations
would require either concatenation along the feature dimen-
sion (increasing the size of the latter by a factor or L+1) or an
exhaustive search (i.e. training a different classifier for each
of the L+ 1 layers) for finding the single most informative
representation for each task.
The SUPERB protocol suggests the weighted-average
type of layer-wise pooling for evaluating different models
and tasks, and so do we in this work. Note that this type
of layer-wise pooling was also used in ELMo, a bidirec-
tional LSTM-based language model with skip connections,
designed to extract deep contextualized word representa-
tions [11].
3.3. Mean pooling
Tasks requiring a sentence-level classification typically em-
ploy a pooling method, such as mean, max or attentive pool-
ing. Mean pooling, which is employed in SUPERB is defined
as
r=¯
h=1
T
T
X
t=1
ht,(2)
where Tis the number of acoustic features of an utterance
extracted by the ConvNet, ris the resulting pooled represen-
tation, while htare the representations at time tafter layer-
wise pooling. Concatenating the pooled representations with
摘要:

EXTRACTINGSPEAKERANDEMOTIONINFORMATIONFROMSELF-SUPERVISEDSPEECHMODELSVIACHANNEL-WISECORRELATIONSThemosStafylakis1y,LadislavMosner2y,SofoklisKakouros3y,OldrichPlchot2,Luk´asBurget2,JanCernock´y21Omilia-ConversationalIntelligence,Athens,Greece2BrnoUniversityofTechnology,FacultyofInformationTechnol...

展开>> 收起<<
EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS Themos Stafylakis1y Ladislav Mo ˇsner2y Sofoklis Kakouros3y Old ˇrich Plchot2 Luk aˇs Burget2 Jan ˇCernock y2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:8 页 大小:291.41KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注