Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings JanˇSvec1 Lubo ˇsˇSmıdl1 Josef V . Psutka1 Aleˇs Pra ˇzak1

2025-05-03 0 0 348.35KB 5 页 10玖币

Spoken Term Detection and Relevance Score Estimation

using Dot-Product of Pronunciation Embeddings

Jan ˇ

Svec1, Luboˇ

sˇ

Sm´

ıdl1, Josef V. Psutka1, Aleˇ

s Praˇ

z´

ak1

1Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic

{honzas,smidl,psutka j,aprazak}@kky.zcu.cz

Abstract

The paper describes a novel approach to Spoken Term Detec-

tion (STD) in large spoken archives using deep LSTM net-

works. The work is based on the previous approach of using

Siamese neural networks for STD and naturally extends it to di-

rectly localize a spoken term and estimate its relevance score.

The phoneme confusion network generated by a phoneme rec-

ognizer is processed by the deep LSTM network which projects

each segment of the confusion network into an embedding

space. The searched term is projected into the same embed-

ding space using another deep LSTM network. The relevance

score is then computed using a simple dot-product in the embed-

ding space and calibrated using a sigmoid function to predict the

probability of occurrence. The location of the searched term is

then estimated from the sequence of output probabilities. The

deep LSTM networks are trained in a self-supervised manner

from paired recognition hypotheses on word and phoneme lev-

els. The method is experimentally evaluated on MALACH data

in English and Czech languages.

Index Terms: spoken term detection, relevance-score estima-

tion, speech embeddings

1. Introduction

The task of spoken term detection (STD) for large spo-

ken archives typically employs a large vocabulary continuous

speech recognition (LVCSR). By recognizing and pre-indexing

the spoken utterances the in-vocabulary (IV) queries could be

directly found in the word index. The handling of out-of-

vocabulary (OOV) terms consists of a much wider spectrum

of methods [1] including recognition and indexing of sub-word

units (phonemes, syllables or word fragments) [2, 3, 4], the use

of IV proxy words [5, 6] or the use of acoustic embeddings and

similarity metrics in a vector space [7, 8]. The acoustic embed-

dings often play a role also in the query-by-example (QbE) task

in the low-resourced setup but the idea of neural-network-based

projection of the query and the utterance into a single space

could be reused in the more general STD task employing the

standard speech recognition models [9, 10, 11].

In the QbE task the recurrent neural networks (RNNs) are

usually used in the Siamese conﬁguration – two similar net-

works handle both the utterance and the query. The networks

are often trained using the triplet loss function [12, 13, 8]. The

use of RNNs to process the signal and the query is also present

in the wakeup word detection task [14, 15, 16].

Since we are targeting the large spoken archives for which

the LVCSR system exists and is used for searching the IV terms,

we focused on the methods where the STD is performed using

the phoneme recognizer (in a structure similar to LVCSR) to

search the OOV terms. The idea is not new and we used it in

a mostly heuristic search presented in [2] and subsequently we

adopted the approach of Siamese networks [17] to robustly es-

timate the term relevance scores. In this work we modiﬁed the

Siamese architecture presented in [17] with the goal to simplify

the network architecture and further improve the STD perfor-

mance:

STD process – while the original Siamese architecture was pro-

posed to relevance score estimation only and the localization of

terms was performed using the index of phoneme triplets, the

proposed approach allows both to localize and score the puta-

tive hits of the searched term. The proposed method does not

need any kind of DTW [18, 19] nor subsequence matching [10].

Network structure – we keep the dual structure of the network

where we map the recognized sequence and searched term into

an embedding space using recurrent neural networks with the

same architecture [13, 7]

Loss function – the loss function based on normalized cosine

similarity [18, 13] was replaced with a simple binary cross-

entropy which allows the network outputs to be interpreted as

probabilities of occurrence improving the calibration of the rel-

evance scores [20].

Network training – the idea of self-supervised learning from

”blindly” recognized hypotheses on word and phoneme levels

was used in a similar way. This way a large amount of train-

ing data could be easily generated from large spoken archives

[11, 17]. Moreover, such training data match exactly the speech

recognizer used and the neural network could model and par-

tially compensate the errors of the recognizer.

Pronunciation embeddings – both the recognition output in

the form of phoneme confusion network (sausage) and the

graphemic representation of the searched term are projected into

the same embedding space [7, 11, 18, 8]. In this space, the

probability of occurrence is computed as a dot-product of two

vectors and calibrated using a simple sigmoid function.

The above-mentioned ideas like the unsupervised transfor-

mation of the utterance into the latent embedding space using

wav2vec method [21], mapping of words into the phonetic em-

bedding space [10, 9] or the calibration of the relevance scores

[22] are broadly used. This paper presents a novel application

of such methods to the STD task in an integrated seamless way.

2. Deep LSTM for Spoken Term Detection

The proposed network architecture reuses some ideas from [17]

especially the mapping from the graphemic representation of

a searched term and from recognized phonemes into a shared

embedding space where the relevance score is easily computed.

The key difference from the previous work is that the sequence

of recognized phonemes is mapped to a sequence of embedded

vectors of the same length while the Siamese network used the

mapping to a single vector. This way the scores between the

searched term and the phoneme sequence are computed on a

per-phoneme basis (or per confusion network segment if using

arXiv:2210.11895v1 [cs.CL] 21 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SpokenTermDetectionandRelevanceScoreEstimationusingDot-ProductofPronunciationEmbeddingsJanSvec1,LubosSm´dl1,JosefV.Psutka1,AlesPraz´ak11DepartmentofCybernetics,UniversityofWestBohemia,Pilsen,CzechRepublicfhonzas,smidl,psutkaj,aprazakg@kky.zcu.czAbstractThepaperdescribesanovelapproachtoSpokenTe...

展开>> 收起<<

Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings JanˇSvec1 Lubo ˇsˇSmıdl1 Josef V . Psutka1 Aleˇs Pra ˇzak1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：5 页 大小：348.35KB 格式：PDF 时间：2025-05-03

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

MAOOA..
高级编辑

文档 14218 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 5

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录