Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer JanˇSvec Jan Lehe ˇcka Lubo ˇsˇSmıdl Department of Cybernetics University of West Bohemia Pilsen Czech Republic

2025-05-06 0 0 412.08KB 5 页 10玖币

Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer

Jan ˇ

Svec, Jan Leheˇ

cka, Luboˇ

sˇ

Sm´

ıdl

Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic

{honzas,jlehecka,smidl}@kky.zcu.cz

Abstract

In recent years, the standard hybrid DNN-HMM speech recog-

nizers are outperformed by the end-to-end speech recognition

systems. One of the very promising approaches is the grapheme

Wav2Vec 2.0 model, which uses the self-supervised pretrain-

ing approach combined with transfer learning of the ﬁne-tuned

speech recognizer. Since it lacks the pronunciation vocabulary

and language model, the approach is suitable for tasks where

obtaining such models is not easy or almost impossible.

In this paper, we use the Wav2Vec speech recognizer in the

task of spoken term detection over a large set of spoken docu-

ments. The method employs a deep LSTM network which maps

the recognized hypothesis and the searched term into a shared

pronunciation embedding space in which the term occurrences

and the assigned scores are easily computed.

The paper describes a bootstrapping approach that allows

the transfer of the knowledge contained in traditional pronunci-

ation vocabulary of DNN-HMM hybrid ASR into the context of

grapheme-based Wav2Vec. The proposed method outperforms

the previously published system based on the combination of

the DNN-HMM hybrid ASR and phoneme recognizer by a large

margin on the MALACH data in both English and Czech lan-

guages.

Index Terms: Spoken Term Detection, Wav2Vec

1. Introduction

The spoken term detection (STD) task is a widely studied ﬁeld

of speech processing. The STD emerged as a variant of tra-

ditional keyword spotting which speeds up the search phase

by ofﬂine pre-processing and indexing of the searched data

[1], where the pre-processing costs are counterweighted by the

speed of an online search. A conventional approach to STD is to

use the DNN-HMM hybrid speech recognizer to transform the

input audio data into a set of word lattices from which the in-

verted word index is built. The drawback of this approach is the

inability to index the out-of-vocabulary words (OOVs), which

must be handled by other methods, such as the use of proxy-

words [2, 3] or sub-word units [4, 5, 6]. The OOVs are problem-

atic especially in the domain of oral history archives processing,

where the OOVs mostly represent the valuable searched terms,

such as personal names and geographical terms [7].

Historically, the speech recognizers used for oral history

archives included a large amount of domain-speciﬁc human-

made knowledge incorporated into the pronunciation vocabu-

lary or language model [8]. Despite the advances in DNN-

based acoustic modeling the hybrid speech recognition ap-

proach reached its limits and is overcome by the neural end-

to-end systems. Especially for the oral history archives, the

Wav2Vec 2.0 speech recognition approach [9] is very promis-

ing. The method uses no language model nor vocabulary, which

makes it ideal for modeling the OOVs. On the other hand, the

knowledge accumulated in pronunciation vocabularies and lan-

Wav2Vec grapheme

confusion network:

Unlabeled

audio data:

DNN-HMM hybrid

ASR hypothesis: my mother had a grocery

Per-segment

training target:

Training query

0↓1↑ 0↓

Figure 1: Schema of training query extraction from unlabeled

audio data.

guage models is very valuable and could be exploited by the

trainable end-to-end STD system.

The use of trainable models in the STD task or related key-

word spotting and query-by-example tasks is widely studied.

For example, the approach of [10] maps input utterance to a se-

quence of vectors for subsequent STD, authors of [11] train the

projection from acoustic features to an embedding space shared

with the projected embedding of the query. A similar approach

was also used in [12].

In this paper, we propose a method which combines the

Wav2Vec and the DNN-HMM speech recognizers. First, the in-

put audio data is recognized using the grapheme Wav2Vec rec-

ognizer to the form of grapheme confusion networks. Since the

Wav2Vec recognizer does not use the pronunciation vocabulary,

it produces an orthographic transcription of the utterance. Then,

the same data are recognized using the traditional DNN-HMM

hybrid recognizer and the high-conﬁdence recognized words

are then used as query terms and the corresponding segments

of the grapheme confusion network are used as samples of the

particular occurrence of the query term (Fig. 1). The query and

the corresponding training targets (binary value indicating the

occurrence of the query in the audio) are used to train the Deep

LSTM STD neural network [13], which maps the pair into a

joint embedding space in which the score of the putative hit is

computed. This way, the Deep LSTM STD network inherently

learns the phonetic and syntactic knowledge incorporated in the

DNN-HMM hybrid recognizer.

2. Wav2Vec pretraining & ﬁne-tuning

The Wav2Vec 2.0 framework [9] is a self-supervised frame-

work for learning representations from raw audio data. It uses

a multi-layer convolutional neural network to compute frame-

level features. The features are processed using a Transformer

network to predict the context-dependent representation of the

input audio. After pre-training on unlabeled speech, the dense

classiﬁcation layer with softmax activation is added on top of

the Transformer and trained in a supervised manner using the

CTC loss [14] to obtain the grapheme-level speech recognizer.

arXiv:2210.11885v1 [cs.CL] 21 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepLSTMSpokenTermDetectionusingWav2Vec2.0RecognizerJanSvec,JanLehecka,LubosSm´dlDepartmentofCybernetics,UniversityofWestBohemia,Pilsen,CzechRepublicfhonzas,jlehecka,smidlg@kky.zcu.czAbstractInrecentyears,thestandardhybridDNN-HMMspeechrecog-nizersareoutperformedbytheend-to-endspeechrecognitions...

展开>> 收起<<

Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer JanˇSvec Jan Lehe ˇcka Lubo ˇsˇSmıdl Department of Cybernetics University of West Bohemia Pilsen Czech Republic.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：5 页 大小：412.08KB 格式：PDF 时间：2025-05-06

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

近山遥水

初级编辑

文档 15020 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 5

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录