
Spoken Term Detection and Relevance Score Estimation
using Dot-Product of Pronunciation Embeddings
Jan ˇ
Svec1, Luboˇ
sˇ
Sm´
ıdl1, Josef V. Psutka1, Aleˇ
s Praˇ
z´
ak1
1Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic
{honzas,smidl,psutka j,aprazak}@kky.zcu.cz
Abstract
The paper describes a novel approach to Spoken Term Detec-
tion (STD) in large spoken archives using deep LSTM net-
works. The work is based on the previous approach of using
Siamese neural networks for STD and naturally extends it to di-
rectly localize a spoken term and estimate its relevance score.
The phoneme confusion network generated by a phoneme rec-
ognizer is processed by the deep LSTM network which projects
each segment of the confusion network into an embedding
space. The searched term is projected into the same embed-
ding space using another deep LSTM network. The relevance
score is then computed using a simple dot-product in the embed-
ding space and calibrated using a sigmoid function to predict the
probability of occurrence. The location of the searched term is
then estimated from the sequence of output probabilities. The
deep LSTM networks are trained in a self-supervised manner
from paired recognition hypotheses on word and phoneme lev-
els. The method is experimentally evaluated on MALACH data
in English and Czech languages.
Index Terms: spoken term detection, relevance-score estima-
tion, speech embeddings
1. Introduction
The task of spoken term detection (STD) for large spo-
ken archives typically employs a large vocabulary continuous
speech recognition (LVCSR). By recognizing and pre-indexing
the spoken utterances the in-vocabulary (IV) queries could be
directly found in the word index. The handling of out-of-
vocabulary (OOV) terms consists of a much wider spectrum
of methods [1] including recognition and indexing of sub-word
units (phonemes, syllables or word fragments) [2, 3, 4], the use
of IV proxy words [5, 6] or the use of acoustic embeddings and
similarity metrics in a vector space [7, 8]. The acoustic embed-
dings often play a role also in the query-by-example (QbE) task
in the low-resourced setup but the idea of neural-network-based
projection of the query and the utterance into a single space
could be reused in the more general STD task employing the
standard speech recognition models [9, 10, 11].
In the QbE task the recurrent neural networks (RNNs) are
usually used in the Siamese configuration – two similar net-
works handle both the utterance and the query. The networks
are often trained using the triplet loss function [12, 13, 8]. The
use of RNNs to process the signal and the query is also present
in the wakeup word detection task [14, 15, 16].
Since we are targeting the large spoken archives for which
the LVCSR system exists and is used for searching the IV terms,
we focused on the methods where the STD is performed using
the phoneme recognizer (in a structure similar to LVCSR) to
search the OOV terms. The idea is not new and we used it in
a mostly heuristic search presented in [2] and subsequently we
adopted the approach of Siamese networks [17] to robustly es-
timate the term relevance scores. In this work we modified the
Siamese architecture presented in [17] with the goal to simplify
the network architecture and further improve the STD perfor-
mance:
STD process – while the original Siamese architecture was pro-
posed to relevance score estimation only and the localization of
terms was performed using the index of phoneme triplets, the
proposed approach allows both to localize and score the puta-
tive hits of the searched term. The proposed method does not
need any kind of DTW [18, 19] nor subsequence matching [10].
Network structure – we keep the dual structure of the network
where we map the recognized sequence and searched term into
an embedding space using recurrent neural networks with the
same architecture [13, 7]
Loss function – the loss function based on normalized cosine
similarity [18, 13] was replaced with a simple binary cross-
entropy which allows the network outputs to be interpreted as
probabilities of occurrence improving the calibration of the rel-
evance scores [20].
Network training – the idea of self-supervised learning from
”blindly” recognized hypotheses on word and phoneme levels
was used in a similar way. This way a large amount of train-
ing data could be easily generated from large spoken archives
[11, 17]. Moreover, such training data match exactly the speech
recognizer used and the neural network could model and par-
tially compensate the errors of the recognizer.
Pronunciation embeddings – both the recognition output in
the form of phoneme confusion network (sausage) and the
graphemic representation of the searched term are projected into
the same embedding space [7, 11, 18, 8]. In this space, the
probability of occurrence is computed as a dot-product of two
vectors and calibrated using a simple sigmoid function.
The above-mentioned ideas like the unsupervised transfor-
mation of the utterance into the latent embedding space using
wav2vec method [21], mapping of words into the phonetic em-
bedding space [10, 9] or the calibration of the relevance scores
[22] are broadly used. This paper presents a novel application
of such methods to the STD task in an integrated seamless way.
2. Deep LSTM for Spoken Term Detection
The proposed network architecture reuses some ideas from [17]
especially the mapping from the graphemic representation of
a searched term and from recognized phonemes into a shared
embedding space where the relevance score is easily computed.
The key difference from the previous work is that the sequence
of recognized phonemes is mapped to a sequence of embedded
vectors of the same length while the Siamese network used the
mapping to a single vector. This way the scores between the
searched term and the phoneme sequence are computed on a
per-phoneme basis (or per confusion network segment if using
arXiv:2210.11895v1 [cs.CL] 21 Oct 2022