ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS FOR SPOKEN LANGUAGE UNDERSTANDING Gaelle Laperri ere1 Valentin Pelloin2 Micka el Rouvier1 Themos Stafylakis3 Yannick Est eve1

2025-05-02 3 0 980.06KB 8 页 10玖币

侵权投诉

ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS

FOR SPOKEN LANGUAGE UNDERSTANDING

Ga¨

elle Laperri`

ere1, Valentin Pelloin2, Micka¨

el Rouvier1, Themos Stafylakis3, Yannick Est`

eve1

1LIA - Avignon Universit´

e, France

2LIUM - Le Mans Universit´

e, France

3Omilia - Conversational Intelligence, Greece

ABSTRACT

In this paper we examine the use of semantically-aligned

speech representations for end-to-end spoken language under-

standing (SLU). We employ the recently-introduced SAMU-

XLSR model, which is designed to generate a single embed-

ding that captures the semantics at the utterance level, seman-

tically aligned across different languages. This model com-

bines the acoustic frame-level speech representation learning

model (XLS-R) with the Language Agnostic BERT Sentence

Embedding (LaBSE) model. We show that the use of the

SAMU-XLSR model instead of the initial XLS-R model

improves signiﬁcantly the performance in the framework of

end-to-end SLU. Finally, we present the beneﬁts of using this

model towards language portability in SLU.

Index Terms—Spoken language understanding, speech

representation, language portability, cross modality

1. INTRODUCTION

Spoken language understanding (SLU) refers to natural lan-

guage processing tasks related to semantic extraction from

speech [1]. Different tasks can be addressed as SLU tasks,

such as named entity recognition from speech, call routing,

slot ﬁlling task in a context of human-machine dialogue.

To our knowledge, end-to-end neural approaches have

been proposed four years ago in order to directly extract

the semantics from speech signal, by using a single neural

model [2,3,4], instead of applying a classical cascade ap-

proach based on the use of an automatic speech recognition

(ASR) system, followed by a natural language understanding

processing (NLU) module applied to the automatic tran-

scription [1]. Two are the main advantages of end-to-end

approaches. The ﬁrst one is related to the joint optimization

of the ASR and NLU part, since the unique neural model is

optimized only for the ﬁnal SLU task. The second one is

the mitigation of error propagation: when using a cascade

approach, errors generated by the ﬁrst modules propagate to

the following ones.

Since 2018, end-to-end approaches have became very

popular in the SLU literature [5,6,7,8,9]. A main issue

of these approaches is the lack of bimodal annotated data

(speech audio recordings with semantic manual annotation).

Several methods have been proposed in order to address

this issue, e.g. transfer learning techniques [10,11], [12]

or artiﬁcial augmentation of the training data using speech

synthesis [13,14].

Self-supervised learning (SSL), that beneﬁts from unla-

belled data, recently opened new perspectives for automatic

speech recognition and natural language processing [15,16].

SSL has been successfully applied to several SLU tasks, es-

pecially through cascade approaches [17]: the ASR system

beneﬁts from learning better speech unit representations [18,

19,20] while the NLU module beneﬁts from BERT-like mod-

els [16]. The use of an end-to-end approach exploiting di-

rectly both speech and text SSL models is limited by the dif-

ﬁculty to unify the speech and textual representation spaces,

in addition to the complexity of managing a huge number of

model parameters. Some approaches have been proposed in

order to exploit the BERT-like capabilities within an end-to-

end SLU model, e.g. by projecting some kinds of sequences

of embeddings extracted by an ASR sub-module to a BERT

model [21,22], or by tying at the sentence level the acoustic

embeddings to a SLU ﬁne-tuned BERT model for a speech

intent detection task [12,23]. In [24], a similar approach

is extended in order to build a multilingual end-to-end SLU

model, again for speech intent detection.

Earlier this year, a new promising model was introduced

in [25]. The model combines a state-of-the-art multilingual

acoustic frame-level speech representation learning model

XLS-R [26] with the Language Agnostic BERT Sentence

Embedding [27] (LaBSE) model to create an utterance-

level multimodal multilingual speech encoder. This model

is named SAMU-XLSR, for Semantically-Aligned Multi-

modal Utterance-level Cross-Lingual Speech Representation

learning framework.

In this paper, we analyze the performance and the be-

havior of the SAMU-XLSR model using the French MEDIA

benchmark dataset, which is considered as a very challeng-

ing benchmarks for SLU [28]. Moreover, by using the Italian

PortMEDIA corpus [29], we also investigate the potential of

arXiv:2210.05291v1 [cs.CL] 11 Oct 2022

porting an existing end-to-end SLU model from one language

(French) to another (Italian) through two scenarios concern-

ing the target language: zero-shot or low-resource learning.

2. SAMU-XLSR

Self-supervised representation learning (SSL) approaches

such as Wav2Vec-2.0 [15], HuBERT [20], and WavLM [30]

aim to provide powerful deep feature learning (speech em-

bedding) without requiring large annotated datasets. Speech

embeddings are extracted at the acoustic frame-level i.e. for

short speech segments of 20 ms duration, and they can be

used as input features to a model that is speciﬁc for the down-

stream task. These speech encoders have been successfully

used in several tasks, such as automatic speech recogni-

tion [15], speaker veriﬁcation [31,32] and emotion recog-

nition [33,34]. Self-supervision learning for such speech

encoders is designed to discover speech representations that

encode pseudo-phonetic or phonotactic information rather

than high-level semantic information [35]. On the other hand,

high-level semantic information is particularly useful in some

tasks such as Machine Translation (MT) or Spoken Language

Understanding (SLU). In [25], the authors propose to ad-

dress this issue using a new framework called SAMU-XLSR,

which learns semantically-aligned multimodal utterance-level

cross-lingual speech representations.

Fig. 1. Training process of SAMU-XLSR.

SAMU-XLSR is based on the pre-trained multilingual

XLS-R 1[26] on top of which all the embeddings gener-

ated by processing an audio ﬁle are connected to an attentive

pooling module.

Thanks to this pooling mechanism (which is followed

by linear projection layer and the tanh function), the frame-

level contextual representations are transformed into a single

utterance-level embedding vector. Figure 1summarizes the

training process of the SAMU-XLSR model. Notice than the

weights from the pre-trained XLS-R model continue being

updated during the process.

The utterance-level embedding vector of SAMU-XLSR

is trained via knowledge distillation from the pre-trained

language agnostic LaBSE model [27]. The LaBSE model2

has been trained on 109 languages and its text embedding

space is semantically aligned across these 109 languages.

LaBSE attains state-of-the-art performance on various bi-text

retrieval/mining tasks, while yielding promising zero-shot

performance for languages not included in the training set

(probably thanks to language similarities). Thus, given a spo-

ken utterance, the parameters of SAMU-XLSR are trained to

accurately predict a text embedding provided by the LaBSE

text encoder of its corresponding transcript. Because LaBSE

embedding space is semantically-aligned across various lan-

guages, the text transcript would be clustered together with

its text translations.

By pulling the speech embedding towards the anchor em-

bedding, cross-lingual speech-text alignments are automati-

cally learned without ever seeing cross-lingual associations

during training. This property is particularly interesting in

the SLU context in order to port an existing model built on a

well-resourced language to another language with zero or low

resources for training.

3. APPLICATION TO SPOKEN LANGUAGE

UNDERSTANDING

As deﬁned in [1], spoken language understanding is the in-

terpretation of signs conveyed by a speech signal. This in-

terpretation refers to a semantic representation manageable

by computers. Usually, this semantic representation is ded-

icated to an application domain that restricts the semantic

ﬁeld. With the massive deployment of voice assistants like

Apple’s Siri, Amazon Alexa, Google Assistant, etc. a lot

of recent papers aim to process speech intent detection as an

SLU task [12,13,14,24,23]. In such a task, only one speech

intent is generally expected by sentence: the speech intent

detection task could be considered as a classiﬁcation task at

the sentence-level and, in addition, the SLU model has to ﬁll

some expected slots corresponding to the detected intent.

SLU benchmarks related to task-driven human-machine

spoken dialogue can be more or less complex, depending on

the richness of the semantic representation. In this study, we

1https://huggingface.co/facebook/wav2vec2-xls-r-300m

2https://huggingface.co/sentence-transformers/LaBSE

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ONTHEUSEOFSEMANTICALLY-ALIGNEDSPEECHREPRESENTATIONSFORSPOKENLANGUAGEUNDERSTANDINGGa¨elleLaperriere1,ValentinPelloin2,Micka¨elRouvier1,ThemosStafylakis3,YannickEsteve11LIA-AvignonUniversit´e,France2LIUM-LeMansUniversit´e,France3Omilia-ConversationalIntelligence,GreeceABSTRACTInthispaperweexaminethe...

展开>> 收起<<

ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS FOR SPOKEN LANGUAGE UNDERSTANDING Gaelle Laperri ere1 Valentin Pelloin2 Micka el Rouvier1 Themos Stafylakis3 Yannick Est eve1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS FOR SPOKEN LANGUAGE UNDERSTANDING Gaelle Laperri ere1 Valentin Pelloin2 Micka el Rouvier1 Themos Stafylakis3 Yannick Est eve1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: