DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS
Dawei Liang?, Hang Su†, Tarun Singh†, Jay Mahadeokar†, Shanil Puri†, Jiedan Zhu†,
Edison Thomaz?, Mike Seltzer†
?University of Texas at Austin, †Meta AI.
ABSTRACT
Interactive voice assistants have been widely used as input in-
terfaces in various scenarios, e.g. on smart homes devices,
wearables and on AR devices. Detecting the end of a speech
query, i.e. speech end-pointing, is an important task for voice
assistants to interact with users. Traditionally, speech end-
pointing is based on pure classification methods along with
arbitrary binary targets. In this paper, we propose a novel
regression-based speech end-pointing model, which enables
an end-pointer to adjust its detection behavior based on con-
text of user queries. Specifically, we present a pause mod-
eling method and show its effectiveness for dynamic end-
pointing. Based on our experiments with vendor-collected
smartphone and wearables speech queries, our strategy shows
a better trade-off between endpointing latency and accuracy,
compared to the traditional classification-based method. We
further discuss the benefits of this model and generalization
of the framework in the paper.
Index Terms—endpointing, end-of-query, interactive
voice assistant.
1. INTRODUCTION
With rapid development of speech technologies in recent
years, interactive voice assistants have been widely adopted
as a mainstream for intelligent user interface [1, 2, 3, 4]. By
taking users’ speech queries, these systems are able to per-
form a variety of tasks from basic question answering, music
playing, calling and messaging, to device control. As an ini-
tial step, a voice assistant needs to determine the time point
when a user finishes the query so that it knows when to close
the microphone and continue downstream processing (e.g.
language understanding and taking action). This process is
often referred to as speech endpoint detection or end-pointing.
In practice, the challenge for speech end-pointing lies in the
conflict goals of fast response and endpoint accuracy (avoid-
ing early cuts of user speech). Specifically, a rapid close of
the microphone and response to user queries brings a better
user experience if endpointed correctly, but this inevitably
increases the risk of early-cut of user queries. The perfor-
mance of an end-pointing system is thus evaluated on how
this conflict is resolved in practical cases.
Canonically, voice activity detection (VAD) has been
widely used for speech end-pointing [5, 6]. In the VAD set-
ting, a model is developed to distinguish speech segments
from non-speech segments, which usually includes silence,
music, and background noises [7, 8, 9]. The end of the query
can then be determined if a fixed duration of silence is ob-
served by the VAD system. However, this approach is not
reliable enough as pointed out by later work [10, 11]. The
fact that a VAD system is typically not trained to distinguish
pauses within and at the end of queries prevents the system
from capturing enough acoustic cues related to end-of-query
detection, such as speaking rhythm or filler sounds [11].
Recent researches on speech end-pointing mostly focus
on classification methods, where a dedicated end-pointer is
developed to classify audio frames into end-of-query frames
and other frames [11, 12]. The success of the Long Short-
Term Memory (LSTM) [13] architecture contributes to this
method. Some other efforts include additional text decoding
features [14] or user personalized i-vectors [15] to better fit
the end-pointer for specific acoustic environments. In an end-
to-end automated speech recognition (ASR) system, the end-
pointer may also be jointly optimized with the recognition
model [16]. Despite the promising results, all of the above
works focus on binary detection of end-of-query with hard la-
bels (e.g. 0 and 1). In real scenarios, however, an endpointer
shall adjust its endpointing aggressiveness based on semantic,
prosodic or other speaking patterns in the query. The tradi-
tional binary targets for classification can be less flexible in
this respect.
In this paper we study a novel speech end-pointing strat-
egy based on regression method. In this setup, an end-pointer
is optimized to fit soft-coded targets during training, and the
training targets are set by considering expected pause given
semantic context of the queries. By testing on both 14.4M
smartphone speech queries and 467K wearables user queries,
we show that our proposed method effectively reduces the re-
sponse delay of end-pointing while maintaining a compara-
ble accuracy performance as the conventional classification-
based method.
arXiv:2210.14252v1 [cs.SD] 25 Oct 2022