ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS
FOR SPOKEN LANGUAGE UNDERSTANDING
Ga¨
elle Laperri`
ere1, Valentin Pelloin2, Micka¨
el Rouvier1, Themos Stafylakis3, Yannick Est`
eve1
1LIA - Avignon Universit´
e, France
2LIUM - Le Mans Universit´
e, France
3Omilia - Conversational Intelligence, Greece
ABSTRACT
In this paper we examine the use of semantically-aligned
speech representations for end-to-end spoken language under-
standing (SLU). We employ the recently-introduced SAMU-
XLSR model, which is designed to generate a single embed-
ding that captures the semantics at the utterance level, seman-
tically aligned across different languages. This model com-
bines the acoustic frame-level speech representation learning
model (XLS-R) with the Language Agnostic BERT Sentence
Embedding (LaBSE) model. We show that the use of the
SAMU-XLSR model instead of the initial XLS-R model
improves significantly the performance in the framework of
end-to-end SLU. Finally, we present the benefits of using this
model towards language portability in SLU.
Index Terms—Spoken language understanding, speech
representation, language portability, cross modality
1. INTRODUCTION
Spoken language understanding (SLU) refers to natural lan-
guage processing tasks related to semantic extraction from
speech [1]. Different tasks can be addressed as SLU tasks,
such as named entity recognition from speech, call routing,
slot filling task in a context of human-machine dialogue.
To our knowledge, end-to-end neural approaches have
been proposed four years ago in order to directly extract
the semantics from speech signal, by using a single neural
model [2,3,4], instead of applying a classical cascade ap-
proach based on the use of an automatic speech recognition
(ASR) system, followed by a natural language understanding
processing (NLU) module applied to the automatic tran-
scription [1]. Two are the main advantages of end-to-end
approaches. The first one is related to the joint optimization
of the ASR and NLU part, since the unique neural model is
optimized only for the final SLU task. The second one is
the mitigation of error propagation: when using a cascade
approach, errors generated by the first modules propagate to
the following ones.
Since 2018, end-to-end approaches have became very
popular in the SLU literature [5,6,7,8,9]. A main issue
of these approaches is the lack of bimodal annotated data
(speech audio recordings with semantic manual annotation).
Several methods have been proposed in order to address
this issue, e.g. transfer learning techniques [10,11], [12]
or artificial augmentation of the training data using speech
synthesis [13,14].
Self-supervised learning (SSL), that benefits from unla-
belled data, recently opened new perspectives for automatic
speech recognition and natural language processing [15,16].
SSL has been successfully applied to several SLU tasks, es-
pecially through cascade approaches [17]: the ASR system
benefits from learning better speech unit representations [18,
19,20] while the NLU module benefits from BERT-like mod-
els [16]. The use of an end-to-end approach exploiting di-
rectly both speech and text SSL models is limited by the dif-
ficulty to unify the speech and textual representation spaces,
in addition to the complexity of managing a huge number of
model parameters. Some approaches have been proposed in
order to exploit the BERT-like capabilities within an end-to-
end SLU model, e.g. by projecting some kinds of sequences
of embeddings extracted by an ASR sub-module to a BERT
model [21,22], or by tying at the sentence level the acoustic
embeddings to a SLU fine-tuned BERT model for a speech
intent detection task [12,23]. In [24], a similar approach
is extended in order to build a multilingual end-to-end SLU
model, again for speech intent detection.
Earlier this year, a new promising model was introduced
in [25]. The model combines a state-of-the-art multilingual
acoustic frame-level speech representation learning model
XLS-R [26] with the Language Agnostic BERT Sentence
Embedding [27] (LaBSE) model to create an utterance-
level multimodal multilingual speech encoder. This model
is named SAMU-XLSR, for Semantically-Aligned Multi-
modal Utterance-level Cross-Lingual Speech Representation
learning framework.
In this paper, we analyze the performance and the be-
havior of the SAMU-XLSR model using the French MEDIA
benchmark dataset, which is considered as a very challeng-
ing benchmarks for SLU [28]. Moreover, by using the Italian
PortMEDIA corpus [29], we also investigate the potential of
arXiv:2210.05291v1 [cs.CL] 11 Oct 2022