ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS FOR SPOKEN LANGUAGE UNDERSTANDING Gaelle Laperri ere1 Valentin Pelloin2 Micka el Rouvier1 Themos Stafylakis3 Yannick Est eve1

2025-05-02 0 0 980.06KB 8 页 10玖币
侵权投诉
ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS
FOR SPOKEN LANGUAGE UNDERSTANDING
Ga¨
elle Laperri`
ere1, Valentin Pelloin2, Micka¨
el Rouvier1, Themos Stafylakis3, Yannick Est`
eve1
1LIA - Avignon Universit´
e, France
2LIUM - Le Mans Universit´
e, France
3Omilia - Conversational Intelligence, Greece
ABSTRACT
In this paper we examine the use of semantically-aligned
speech representations for end-to-end spoken language under-
standing (SLU). We employ the recently-introduced SAMU-
XLSR model, which is designed to generate a single embed-
ding that captures the semantics at the utterance level, seman-
tically aligned across different languages. This model com-
bines the acoustic frame-level speech representation learning
model (XLS-R) with the Language Agnostic BERT Sentence
Embedding (LaBSE) model. We show that the use of the
SAMU-XLSR model instead of the initial XLS-R model
improves significantly the performance in the framework of
end-to-end SLU. Finally, we present the benefits of using this
model towards language portability in SLU.
Index TermsSpoken language understanding, speech
representation, language portability, cross modality
1. INTRODUCTION
Spoken language understanding (SLU) refers to natural lan-
guage processing tasks related to semantic extraction from
speech [1]. Different tasks can be addressed as SLU tasks,
such as named entity recognition from speech, call routing,
slot filling task in a context of human-machine dialogue.
To our knowledge, end-to-end neural approaches have
been proposed four years ago in order to directly extract
the semantics from speech signal, by using a single neural
model [2,3,4], instead of applying a classical cascade ap-
proach based on the use of an automatic speech recognition
(ASR) system, followed by a natural language understanding
processing (NLU) module applied to the automatic tran-
scription [1]. Two are the main advantages of end-to-end
approaches. The first one is related to the joint optimization
of the ASR and NLU part, since the unique neural model is
optimized only for the final SLU task. The second one is
the mitigation of error propagation: when using a cascade
approach, errors generated by the first modules propagate to
the following ones.
Since 2018, end-to-end approaches have became very
popular in the SLU literature [5,6,7,8,9]. A main issue
of these approaches is the lack of bimodal annotated data
(speech audio recordings with semantic manual annotation).
Several methods have been proposed in order to address
this issue, e.g. transfer learning techniques [10,11], [12]
or artificial augmentation of the training data using speech
synthesis [13,14].
Self-supervised learning (SSL), that benefits from unla-
belled data, recently opened new perspectives for automatic
speech recognition and natural language processing [15,16].
SSL has been successfully applied to several SLU tasks, es-
pecially through cascade approaches [17]: the ASR system
benefits from learning better speech unit representations [18,
19,20] while the NLU module benefits from BERT-like mod-
els [16]. The use of an end-to-end approach exploiting di-
rectly both speech and text SSL models is limited by the dif-
ficulty to unify the speech and textual representation spaces,
in addition to the complexity of managing a huge number of
model parameters. Some approaches have been proposed in
order to exploit the BERT-like capabilities within an end-to-
end SLU model, e.g. by projecting some kinds of sequences
of embeddings extracted by an ASR sub-module to a BERT
model [21,22], or by tying at the sentence level the acoustic
embeddings to a SLU fine-tuned BERT model for a speech
intent detection task [12,23]. In [24], a similar approach
is extended in order to build a multilingual end-to-end SLU
model, again for speech intent detection.
Earlier this year, a new promising model was introduced
in [25]. The model combines a state-of-the-art multilingual
acoustic frame-level speech representation learning model
XLS-R [26] with the Language Agnostic BERT Sentence
Embedding [27] (LaBSE) model to create an utterance-
level multimodal multilingual speech encoder. This model
is named SAMU-XLSR, for Semantically-Aligned Multi-
modal Utterance-level Cross-Lingual Speech Representation
learning framework.
In this paper, we analyze the performance and the be-
havior of the SAMU-XLSR model using the French MEDIA
benchmark dataset, which is considered as a very challeng-
ing benchmarks for SLU [28]. Moreover, by using the Italian
PortMEDIA corpus [29], we also investigate the potential of
arXiv:2210.05291v1 [cs.CL] 11 Oct 2022
porting an existing end-to-end SLU model from one language
(French) to another (Italian) through two scenarios concern-
ing the target language: zero-shot or low-resource learning.
2. SAMU-XLSR
Self-supervised representation learning (SSL) approaches
such as Wav2Vec-2.0 [15], HuBERT [20], and WavLM [30]
aim to provide powerful deep feature learning (speech em-
bedding) without requiring large annotated datasets. Speech
embeddings are extracted at the acoustic frame-level i.e. for
short speech segments of 20 ms duration, and they can be
used as input features to a model that is specific for the down-
stream task. These speech encoders have been successfully
used in several tasks, such as automatic speech recogni-
tion [15], speaker verification [31,32] and emotion recog-
nition [33,34]. Self-supervision learning for such speech
encoders is designed to discover speech representations that
encode pseudo-phonetic or phonotactic information rather
than high-level semantic information [35]. On the other hand,
high-level semantic information is particularly useful in some
tasks such as Machine Translation (MT) or Spoken Language
Understanding (SLU). In [25], the authors propose to ad-
dress this issue using a new framework called SAMU-XLSR,
which learns semantically-aligned multimodal utterance-level
cross-lingual speech representations.
Fig. 1. Training process of SAMU-XLSR.
SAMU-XLSR is based on the pre-trained multilingual
XLS-R 1[26] on top of which all the embeddings gener-
ated by processing an audio file are connected to an attentive
pooling module.
Thanks to this pooling mechanism (which is followed
by linear projection layer and the tanh function), the frame-
level contextual representations are transformed into a single
utterance-level embedding vector. Figure 1summarizes the
training process of the SAMU-XLSR model. Notice than the
weights from the pre-trained XLS-R model continue being
updated during the process.
The utterance-level embedding vector of SAMU-XLSR
is trained via knowledge distillation from the pre-trained
language agnostic LaBSE model [27]. The LaBSE model2
has been trained on 109 languages and its text embedding
space is semantically aligned across these 109 languages.
LaBSE attains state-of-the-art performance on various bi-text
retrieval/mining tasks, while yielding promising zero-shot
performance for languages not included in the training set
(probably thanks to language similarities). Thus, given a spo-
ken utterance, the parameters of SAMU-XLSR are trained to
accurately predict a text embedding provided by the LaBSE
text encoder of its corresponding transcript. Because LaBSE
embedding space is semantically-aligned across various lan-
guages, the text transcript would be clustered together with
its text translations.
By pulling the speech embedding towards the anchor em-
bedding, cross-lingual speech-text alignments are automati-
cally learned without ever seeing cross-lingual associations
during training. This property is particularly interesting in
the SLU context in order to port an existing model built on a
well-resourced language to another language with zero or low
resources for training.
3. APPLICATION TO SPOKEN LANGUAGE
UNDERSTANDING
As defined in [1], spoken language understanding is the in-
terpretation of signs conveyed by a speech signal. This in-
terpretation refers to a semantic representation manageable
by computers. Usually, this semantic representation is ded-
icated to an application domain that restricts the semantic
field. With the massive deployment of voice assistants like
Apple’s Siri, Amazon Alexa, Google Assistant, etc. a lot
of recent papers aim to process speech intent detection as an
SLU task [12,13,14,24,23]. In such a task, only one speech
intent is generally expected by sentence: the speech intent
detection task could be considered as a classification task at
the sentence-level and, in addition, the SLU model has to fill
some expected slots corresponding to the detected intent.
SLU benchmarks related to task-driven human-machine
spoken dialogue can be more or less complex, depending on
the richness of the semantic representation. In this study, we
1https://huggingface.co/facebook/wav2vec2-xls-r-300m
2https://huggingface.co/sentence-transformers/LaBSE
摘要:

ONTHEUSEOFSEMANTICALLY-ALIGNEDSPEECHREPRESENTATIONSFORSPOKENLANGUAGEUNDERSTANDINGGa¨elleLaperriere1,ValentinPelloin2,Micka¨elRouvier1,ThemosStafylakis3,YannickEsteve11LIA-AvignonUniversit´e,France2LIUM-LeMansUniversit´e,France3Omilia-ConversationalIntelligence,GreeceABSTRACTInthispaperweexaminethe...

展开>> 收起<<
ON THE USE OF SEMANTICALLY-ALIGNED SPEECH REPRESENTATIONS FOR SPOKEN LANGUAGE UNDERSTANDING Gaelle Laperri ere1 Valentin Pelloin2 Micka el Rouvier1 Themos Stafylakis3 Yannick Est eve1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:980.06KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注