DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS Dawei Liang Hang Suy Tarun Singhy Jay Mahadeokary Shanil Puriy Jiedan Zhuy Edison Thomaz Mike Seltzery

2025-05-03 0 0 294.39KB 5 页 10玖币

侵权投诉

DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS

Dawei Liang?, Hang Su†, Tarun Singh†, Jay Mahadeokar†, Shanil Puri†, Jiedan Zhu†,

Edison Thomaz?, Mike Seltzer†

?University of Texas at Austin, †Meta AI.

ABSTRACT

Interactive voice assistants have been widely used as input in-

terfaces in various scenarios, e.g. on smart homes devices,

wearables and on AR devices. Detecting the end of a speech

query, i.e. speech end-pointing, is an important task for voice

assistants to interact with users. Traditionally, speech end-

pointing is based on pure classiﬁcation methods along with

arbitrary binary targets. In this paper, we propose a novel

regression-based speech end-pointing model, which enables

an end-pointer to adjust its detection behavior based on con-

text of user queries. Speciﬁcally, we present a pause mod-

eling method and show its effectiveness for dynamic end-

pointing. Based on our experiments with vendor-collected

smartphone and wearables speech queries, our strategy shows

a better trade-off between endpointing latency and accuracy,

compared to the traditional classiﬁcation-based method. We

further discuss the beneﬁts of this model and generalization

of the framework in the paper.

Index Terms—endpointing, end-of-query, interactive

voice assistant.

1. INTRODUCTION

With rapid development of speech technologies in recent

years, interactive voice assistants have been widely adopted

as a mainstream for intelligent user interface [1, 2, 3, 4]. By

taking users’ speech queries, these systems are able to per-

form a variety of tasks from basic question answering, music

playing, calling and messaging, to device control. As an ini-

tial step, a voice assistant needs to determine the time point

when a user ﬁnishes the query so that it knows when to close

the microphone and continue downstream processing (e.g.

language understanding and taking action). This process is

often referred to as speech endpoint detection or end-pointing.

In practice, the challenge for speech end-pointing lies in the

conﬂict goals of fast response and endpoint accuracy (avoid-

ing early cuts of user speech). Speciﬁcally, a rapid close of

the microphone and response to user queries brings a better

user experience if endpointed correctly, but this inevitably

increases the risk of early-cut of user queries. The perfor-

mance of an end-pointing system is thus evaluated on how

this conﬂict is resolved in practical cases.

Canonically, voice activity detection (VAD) has been

widely used for speech end-pointing [5, 6]. In the VAD set-

ting, a model is developed to distinguish speech segments

from non-speech segments, which usually includes silence,

music, and background noises [7, 8, 9]. The end of the query

can then be determined if a ﬁxed duration of silence is ob-

served by the VAD system. However, this approach is not

reliable enough as pointed out by later work [10, 11]. The

fact that a VAD system is typically not trained to distinguish

pauses within and at the end of queries prevents the system

from capturing enough acoustic cues related to end-of-query

detection, such as speaking rhythm or ﬁller sounds [11].

Recent researches on speech end-pointing mostly focus

on classiﬁcation methods, where a dedicated end-pointer is

developed to classify audio frames into end-of-query frames

and other frames [11, 12]. The success of the Long Short-

Term Memory (LSTM) [13] architecture contributes to this

method. Some other efforts include additional text decoding

features [14] or user personalized i-vectors [15] to better ﬁt

the end-pointer for speciﬁc acoustic environments. In an end-

to-end automated speech recognition (ASR) system, the end-

pointer may also be jointly optimized with the recognition

model [16]. Despite the promising results, all of the above

works focus on binary detection of end-of-query with hard la-

bels (e.g. 0 and 1). In real scenarios, however, an endpointer

shall adjust its endpointing aggressiveness based on semantic,

prosodic or other speaking patterns in the query. The tradi-

tional binary targets for classiﬁcation can be less ﬂexible in

this respect.

In this paper we study a novel speech end-pointing strat-

egy based on regression method. In this setup, an end-pointer

is optimized to ﬁt soft-coded targets during training, and the

training targets are set by considering expected pause given

semantic context of the queries. By testing on both 14.4M

smartphone speech queries and 467K wearables user queries,

we show that our proposed method effectively reduces the re-

sponse delay of end-pointing while maintaining a compara-

ble accuracy performance as the conventional classiﬁcation-

based method.

arXiv:2210.14252v1 [cs.SD] 25 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DYNAMICSPEECHENDPOINTDETECTIONWITHREGRESSIONTARGETSDaweiLiang?,HangSuy,TarunSinghy,JayMahadeokary,ShanilPuriy,JiedanZhuy,EdisonThomaz?,MikeSeltzery?UniversityofTexasatAustin,yMetaAI.ABSTRACTInteractivevoiceassistantshavebeenwidelyusedasinputin-terfacesinvariousscenarios,e.g.onsmarthomesdevices,weara...

展开>> 收起<<

DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS Dawei Liang Hang Suy Tarun Singhy Jay Mahadeokary Shanil Puriy Jiedan Zhuy Edison Thomaz Mike Seltzery.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DYNAMIC SPEECH ENDPOINT DETECTION WITH REGRESSION TARGETS Dawei Liang Hang Suy Tarun Singhy Jay Mahadeokary Shanil Puriy Jiedan Zhuy Edison Thomaz Mike Seltzery

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: