Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer JanˇSvec Jan Lehe ˇcka Lubo ˇsˇSmıdl Department of Cybernetics University of West Bohemia Pilsen Czech Republic

2025-05-06 0 0 412.08KB 5 页 10玖币
侵权投诉
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
Jan ˇ
Svec, Jan Leheˇ
cka, Luboˇ
sˇ
Sm´
ıdl
Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic
{honzas,jlehecka,smidl}@kky.zcu.cz
Abstract
In recent years, the standard hybrid DNN-HMM speech recog-
nizers are outperformed by the end-to-end speech recognition
systems. One of the very promising approaches is the grapheme
Wav2Vec 2.0 model, which uses the self-supervised pretrain-
ing approach combined with transfer learning of the fine-tuned
speech recognizer. Since it lacks the pronunciation vocabulary
and language model, the approach is suitable for tasks where
obtaining such models is not easy or almost impossible.
In this paper, we use the Wav2Vec speech recognizer in the
task of spoken term detection over a large set of spoken docu-
ments. The method employs a deep LSTM network which maps
the recognized hypothesis and the searched term into a shared
pronunciation embedding space in which the term occurrences
and the assigned scores are easily computed.
The paper describes a bootstrapping approach that allows
the transfer of the knowledge contained in traditional pronunci-
ation vocabulary of DNN-HMM hybrid ASR into the context of
grapheme-based Wav2Vec. The proposed method outperforms
the previously published system based on the combination of
the DNN-HMM hybrid ASR and phoneme recognizer by a large
margin on the MALACH data in both English and Czech lan-
guages.
Index Terms: Spoken Term Detection, Wav2Vec
1. Introduction
The spoken term detection (STD) task is a widely studied field
of speech processing. The STD emerged as a variant of tra-
ditional keyword spotting which speeds up the search phase
by offline pre-processing and indexing of the searched data
[1], where the pre-processing costs are counterweighted by the
speed of an online search. A conventional approach to STD is to
use the DNN-HMM hybrid speech recognizer to transform the
input audio data into a set of word lattices from which the in-
verted word index is built. The drawback of this approach is the
inability to index the out-of-vocabulary words (OOVs), which
must be handled by other methods, such as the use of proxy-
words [2, 3] or sub-word units [4, 5, 6]. The OOVs are problem-
atic especially in the domain of oral history archives processing,
where the OOVs mostly represent the valuable searched terms,
such as personal names and geographical terms [7].
Historically, the speech recognizers used for oral history
archives included a large amount of domain-specific human-
made knowledge incorporated into the pronunciation vocabu-
lary or language model [8]. Despite the advances in DNN-
based acoustic modeling the hybrid speech recognition ap-
proach reached its limits and is overcome by the neural end-
to-end systems. Especially for the oral history archives, the
Wav2Vec 2.0 speech recognition approach [9] is very promis-
ing. The method uses no language model nor vocabulary, which
makes it ideal for modeling the OOVs. On the other hand, the
knowledge accumulated in pronunciation vocabularies and lan-
Wav2Vec grapheme
confusion network:
Unlabeled
audio data:
DNN-HMM hybrid
ASR hypothesis: my mother had a grocery
Per-segment
training target:
Training query
01 0
Figure 1: Schema of training query extraction from unlabeled
audio data.
guage models is very valuable and could be exploited by the
trainable end-to-end STD system.
The use of trainable models in the STD task or related key-
word spotting and query-by-example tasks is widely studied.
For example, the approach of [10] maps input utterance to a se-
quence of vectors for subsequent STD, authors of [11] train the
projection from acoustic features to an embedding space shared
with the projected embedding of the query. A similar approach
was also used in [12].
In this paper, we propose a method which combines the
Wav2Vec and the DNN-HMM speech recognizers. First, the in-
put audio data is recognized using the grapheme Wav2Vec rec-
ognizer to the form of grapheme confusion networks. Since the
Wav2Vec recognizer does not use the pronunciation vocabulary,
it produces an orthographic transcription of the utterance. Then,
the same data are recognized using the traditional DNN-HMM
hybrid recognizer and the high-confidence recognized words
are then used as query terms and the corresponding segments
of the grapheme confusion network are used as samples of the
particular occurrence of the query term (Fig. 1). The query and
the corresponding training targets (binary value indicating the
occurrence of the query in the audio) are used to train the Deep
LSTM STD neural network [13], which maps the pair into a
joint embedding space in which the score of the putative hit is
computed. This way, the Deep LSTM STD network inherently
learns the phonetic and syntactic knowledge incorporated in the
DNN-HMM hybrid recognizer.
2. Wav2Vec pretraining & fine-tuning
The Wav2Vec 2.0 framework [9] is a self-supervised frame-
work for learning representations from raw audio data. It uses
a multi-layer convolutional neural network to compute frame-
level features. The features are processed using a Transformer
network to predict the context-dependent representation of the
input audio. After pre-training on unlabeled speech, the dense
classification layer with softmax activation is added on top of
the Transformer and trained in a supervised manner using the
CTC loss [14] to obtain the grapheme-level speech recognizer.
arXiv:2210.11885v1 [cs.CL] 21 Oct 2022
摘要:

DeepLSTMSpokenTermDetectionusingWav2Vec2.0RecognizerJanSvec,JanLehecka,LubosSm´dlDepartmentofCybernetics,UniversityofWestBohemia,Pilsen,CzechRepublicfhonzas,jlehecka,smidlg@kky.zcu.czAbstractInrecentyears,thestandardhybridDNN-HMMspeechrecog-nizersareoutperformedbytheend-to-endspeechrecognitions...

展开>> 收起<<
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer JanˇSvec Jan Lehe ˇcka Lubo ˇsˇSmıdl Department of Cybernetics University of West Bohemia Pilsen Czech Republic.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:412.08KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注