DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS Dawei Liang Hang Suy Tarun Singhy Jay Mahadeokary Shanil Puriy Jiedan Zhuy Edison Thomaz Mike Seltzery

2025-05-03 0 0 294.39KB 5 页 10玖币
侵权投诉
DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS
Dawei Liang?, Hang Su, Tarun Singh, Jay Mahadeokar, Shanil Puri, Jiedan Zhu,
Edison Thomaz?, Mike Seltzer
?University of Texas at Austin, Meta AI.
ABSTRACT
Interactive voice assistants have been widely used as input in-
terfaces in various scenarios, e.g. on smart homes devices,
wearables and on AR devices. Detecting the end of a speech
query, i.e. speech end-pointing, is an important task for voice
assistants to interact with users. Traditionally, speech end-
pointing is based on pure classification methods along with
arbitrary binary targets. In this paper, we propose a novel
regression-based speech end-pointing model, which enables
an end-pointer to adjust its detection behavior based on con-
text of user queries. Specifically, we present a pause mod-
eling method and show its effectiveness for dynamic end-
pointing. Based on our experiments with vendor-collected
smartphone and wearables speech queries, our strategy shows
a better trade-off between endpointing latency and accuracy,
compared to the traditional classification-based method. We
further discuss the benefits of this model and generalization
of the framework in the paper.
Index Termsendpointing, end-of-query, interactive
voice assistant.
1. INTRODUCTION
With rapid development of speech technologies in recent
years, interactive voice assistants have been widely adopted
as a mainstream for intelligent user interface [1, 2, 3, 4]. By
taking users’ speech queries, these systems are able to per-
form a variety of tasks from basic question answering, music
playing, calling and messaging, to device control. As an ini-
tial step, a voice assistant needs to determine the time point
when a user finishes the query so that it knows when to close
the microphone and continue downstream processing (e.g.
language understanding and taking action). This process is
often referred to as speech endpoint detection or end-pointing.
In practice, the challenge for speech end-pointing lies in the
conflict goals of fast response and endpoint accuracy (avoid-
ing early cuts of user speech). Specifically, a rapid close of
the microphone and response to user queries brings a better
user experience if endpointed correctly, but this inevitably
increases the risk of early-cut of user queries. The perfor-
mance of an end-pointing system is thus evaluated on how
this conflict is resolved in practical cases.
Canonically, voice activity detection (VAD) has been
widely used for speech end-pointing [5, 6]. In the VAD set-
ting, a model is developed to distinguish speech segments
from non-speech segments, which usually includes silence,
music, and background noises [7, 8, 9]. The end of the query
can then be determined if a fixed duration of silence is ob-
served by the VAD system. However, this approach is not
reliable enough as pointed out by later work [10, 11]. The
fact that a VAD system is typically not trained to distinguish
pauses within and at the end of queries prevents the system
from capturing enough acoustic cues related to end-of-query
detection, such as speaking rhythm or filler sounds [11].
Recent researches on speech end-pointing mostly focus
on classification methods, where a dedicated end-pointer is
developed to classify audio frames into end-of-query frames
and other frames [11, 12]. The success of the Long Short-
Term Memory (LSTM) [13] architecture contributes to this
method. Some other efforts include additional text decoding
features [14] or user personalized i-vectors [15] to better fit
the end-pointer for specific acoustic environments. In an end-
to-end automated speech recognition (ASR) system, the end-
pointer may also be jointly optimized with the recognition
model [16]. Despite the promising results, all of the above
works focus on binary detection of end-of-query with hard la-
bels (e.g. 0 and 1). In real scenarios, however, an endpointer
shall adjust its endpointing aggressiveness based on semantic,
prosodic or other speaking patterns in the query. The tradi-
tional binary targets for classification can be less flexible in
this respect.
In this paper we study a novel speech end-pointing strat-
egy based on regression method. In this setup, an end-pointer
is optimized to fit soft-coded targets during training, and the
training targets are set by considering expected pause given
semantic context of the queries. By testing on both 14.4M
smartphone speech queries and 467K wearables user queries,
we show that our proposed method effectively reduces the re-
sponse delay of end-pointing while maintaining a compara-
ble accuracy performance as the conventional classification-
based method.
arXiv:2210.14252v1 [cs.SD] 25 Oct 2022
摘要:

DYNAMICSPEECHENDPOINTDETECTIONWITHREGRESSIONTARGETSDaweiLiang?,HangSuy,TarunSinghy,JayMahadeokary,ShanilPuriy,JiedanZhuy,EdisonThomaz?,MikeSeltzery?UniversityofTexasatAustin,yMetaAI.ABSTRACTInteractivevoiceassistantshavebeenwidelyusedasinputin-terfacesinvariousscenarios,e.g.onsmarthomesdevices,weara...

展开>> 收起<<
DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS Dawei Liang Hang Suy Tarun Singhy Jay Mahadeokary Shanil Puriy Jiedan Zhuy Edison Thomaz Mike Seltzery.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:294.39KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注